Ingesting Data from AWS S3 into Databricks with Auto Loader: Building a Medallion Architecture

In this blog post, we will explore efficient methods for ingesting data from Amazon S3 into Databricks using Auto Loader. Additionally, we will discuss how to perform data transformations and implement a Medallion architecture to improve the management and processing of large datasets.

What is the Medallion Architecture?

The Medallion architecture is a data modeling pattern commonly used in data engineering to organize data into three layers:

Bronze Layer: Raw data ingested from various sources.
Silver Layer: Processed data that undergoes cleaning and transformations.
Gold Layer: Data that is highly refined for analytics and reporting.

This architecture helps streamline the data processing workflow and enhances the quality of insights drawn from the data.

Prerequisites

Before diving into the setup, ensure you have the following:

AWS S3 Bucket: Data should be available in an S3 bucket.
Databricks Account: An active Databricks workspace.
DataBricks Runtime: Ensure your Databricks workspace is using a compatible runtime version that supports Auto Loader.

Step 1: Setting Up Your Environment

Create an AWS S3 Bucket: If you don’t have one, log in to your AWS account and create an S3 bucket. Upload your data files (in formats like CSV, JSON, Parquet, etc.) to this bucket.
Configure IAM Roles: Ensure that your Databricks cluster has the necessary permissions to access the S3 bucket. You may need to create an IAM role and attach a policy that allows data read access to the S3 bucket.

Step 2: Ingesting Data with Auto Loader

Auto Loader is a powerful feature in Databricks that allows you to efficiently and incrementally process files as they arrive in your S3 bucket.

from pyspark.sql import SparkSession

# Create a Spark Session
spark = SparkSession.builder.appName("S3 to Databricks").getOrCreate()

# Define the path to the S3 bucket
s3_path = "s3://your-bucket-name/path-to-data/"

# Read data using Auto Loader
df_bronze = (spark.readStream
             .format("cloudFiles")
             .option("cloudFiles.format", "json")  # Change to your file format
             .load(s3_path))

Step 3: Transforming Data in the Silver Layer

Once the data is ingested into the Bronze layer, you can perform various transformations to clean and enrich the data.

# Example transformation: Filtering and selecting specific columns
df_silver = (df_bronze
             .filter("your_filter_condition")  # Add your filter condition
             .select("column1", "column2", "column3"))  # Select relevant columns

Write the transformed data to a Silver table:

df_silver.writeStream \
    .format("delta") \
    .outputMode("append") \
    .table("silver_table_name")

Step 4: Building the Gold Layer

The final step is to create the Gold layer, where you prepare high-quality aggregate datasets for analytics.

# Aggregating data to create a Gold layer
df_gold = df_silver.groupBy("column1").agg({"column2": "avg", "column3": "count"})

df_gold.writeStream \
    .format("delta") \
    .outputMode("complete") \
    .table("gold_table_name")

Conclusion

By following these steps, you’ve successfully ingested data from AWS S3 into Databricks using Auto Loader and created a Medallion architecture with Bronze, Silver, and Gold layers. This structured approach not only optimizes data processing but also enhances the quality of insights derived from your data.

Feel free to implement these strategies in your data projects for efficient data management and meaningful analytics. Happy data engineering!

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.