In modern data pipelines, automation is key. A common requirement is to automatically trigger an AWS Glue ETL job whenever new data is uploaded to Amazon S3.

AWS provides an elegant, fully serverless way to achieve this using Amazon EventBridge and AWS Lambda.

In this blog, we’ll walk through a real-time data pipeline that:

  • Detects file uploads to Amazon S3
  • Sends events via EventBridge
  • Triggers a Lambda function
  • Starts an AWS Glue job

Let’s dive into the architecture and implementation 🚀


🏗️ Architecture Overview

S3 → EventBridge → Lambda → Glue Job


🔁 End-to-End Flow

  1. Data Upload to S3
    A new file lands in a specific S3 bucket.
  2. EventBridge Rule
    Captures the PutObject event from S3 and forwards it to a Lambda target.
  3. Lambda Function
    Parses the event and starts the corresponding Glue job.
  4. AWS Glue Job
    Performs the ETL (Extract, Transform, Load) processing.

📦 Prerequisites

Before you begin, ensure you have the following:

  • An AWS S3 bucket
  • An AWS Glue job created with a valid script
  • A Lambda execution role with permission to start Glue jobs
  • An EventBridge rule linked to S3 events
  • A Lambda function using Python

🔧 Step-by-Step Implementation


1️⃣ Create an S3 Bucket and Upload Sample Data

Upload your files to a folder path such as:

s3://your-bucket/input-data/

💡 You do not need to configure S3 Event Notifications manually—EventBridge can capture S3 events automatically.


2️⃣ Create an EventBridge Rule

Navigate to:

Amazon EventBridge → Rules → Create rule

Rule Details

  • Name: S3ToGlueTriggerRule
  • Event pattern:
{
"source": ["aws.s3"],
"detail-type": ["Object Created"],
"detail": {
"bucket": {
"name": ["your-bucket"]
},
"object": {
"key": [{
"prefix": "input-data/"
}]
}
}
}

Target

  • Target type: Lambda function
  • Target: (Select the Lambda function created in the next step)

3️⃣ Create the Lambda Function

This Lambda function will listen to S3 events and trigger the Glue job.

🐍 Lambda Python Code

import boto3
import json
import os

glue = boto3.client('glue')

def lambda_handler(event, context):
print("Received Event:", json.dumps(event))

# Extract S3 bucket and object key
bucket = event['detail']['bucket']['name']
key = event['detail']['object']['key']

glue_job_name = os.environ['GLUE_JOB_NAME'] # Environment variable

try:
response = glue.start_job_run(
JobName=glue_job_name,
Arguments={
'--bucket': bucket,
'--key': key
}
)
print("Glue Job Triggered:", response['JobRunId'])
return {
'statusCode': 200,
'body': f"Triggered Glue Job {glue_job_name} with run ID {response['JobRunId']}"
}
except Exception as e:
print("Error triggering Glue Job:", e)
return {
'statusCode': 500,
'body': f"Error: {str(e)}"
}


✅ Lambda Configuration

  • Runtime: Python 3.9 or above
  • Environment Variable: GLUE_JOB_NAME = your-glue-job-name
  • Permissions:
    • glue:StartJobRun
    • CloudWatch Logs permissions

⚠️ Avoid using full admin policies in production—use least privilege.


4️⃣ Create the AWS Glue Job

Below is a basic PySpark Glue script that reads CSV data from S3, performs a simple transformation, and writes the output back to S3.


🔥 AWS Glue PySpark Script

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from pyspark.context import SparkContext
from awsglue.job import Job

# Get parameters passed from Lambda
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'bucket', 'key'])

bucket = args['bucket']
key = args['key']

input_path = f"s3://{bucket}/{key}"
output_path = f"s3://{bucket}/processed-data/"

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read data
df = spark.read.option("header", True).csv(input_path)

# Sample transformation: drop null rows
df_clean = df.na.drop()

# Write output back to S3
df_clean.write.mode("overwrite").parquet(output_path)

job.commit()


🔐 Required IAM Permissions

IAM Policy for Lambda Execution Role

{
"Effect": "Allow",
"Action": [
"glue:StartJobRun",
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "*"
}


🔍 Testing the Workflow

  1. Upload a file (e.g., sales_data.csv) to: s3://your-bucket/input-data/
  2. Verify:
    • Lambda logs in CloudWatch
    • Glue job execution in AWS Glue Console
    • Output files in: s3://your-bucket/processed-data/

🧠 Final Thoughts

This serverless event-driven architecture enables fully automated ETL pipelines in AWS with minimal operational overhead.

By combining S3 + EventBridge + Lambda + Glue, you can:

  • Trigger ETL jobs in real time
  • Eliminate inefficient polling
  • Build scalable, modular, and cost-effective data pipelines

If you’re designing modern data platforms on AWS, this pattern is a must-have in your toolkit ✅