In modern data pipelines, automation is key. A common requirement is to automatically trigger a Glue ETL job every time new data lands in Amazon S3. AWS provides an elegant, serverless way to achieve this using EventBridge and Lambda. In this blog, we’ll walk through a real-time data pipeline that:
- Detects file uploads to S3
- Sends events via EventBridge
- Triggers a Lambda function
- Starts an AWS Glue job
Let’s dive into the architecture and implementation.
🏗️ Architecture Overview
S3 → EventBridge → Lambda → Glue Job
🔁 Flow:
- Data Upload to S3: A new file lands in a specific S3 bucket.
- EventBridge Rule: Captures the
PutObjectevent from S3 and sends it to a Lambda target. - Lambda Function: Parses the event and starts the corresponding Glue job.
- Glue Job: Performs the ETL (Extract, Transform, Load) task.
📦 Prerequisites
- AWS S3 bucket
- AWS Glue job created (with a valid script)
- Lambda execution role with permission to start Glue jobs
- EventBridge rule linked to S3 events
- Lambda function with Python code
🔧 Step-by-Step Implementation
1️⃣ Create an S3 Bucket and Upload Sample Data
Upload your files to a folder path like:
s3://your-bucket/input-data/
Enable event notifications in S3 (optional) or let EventBridge capture them automatically.
2️⃣ Create an EventBridge Rule
Go to EventBridge → Rules → Create Rule
- Name:
S3ToGlueTriggerRule - Event Pattern:
{
"source": ["aws.s3"],
"detail-type": ["Object Created"],
"detail": {
"bucket": {
"name": ["your-bucket"]
},
"object": {
"key": [{
"prefix": "input-data/"
}]
}
}
}
- Target: Lambda Function (we’ll create it next)
3️⃣ Create the Lambda Function
Use the following Python code to trigger a Glue job:
🐍 Lambda Python Code
import boto3
import json
import os
glue = boto3.client('glue')
def lambda_handler(event, context):
print("Received Event:", json.dumps(event))
# Extract S3 bucket and object key
bucket = event['detail']['bucket']['name']
key = event['detail']['object']['key']
glue_job_name = os.environ['GLUE_JOB_NAME'] # Get job name from environment variable
try:
response = glue.start_job_run(
JobName=glue_job_name,
Arguments={
'--bucket': 'bucket', # Replace with your S3 bucket nam
'--key': 'raw/input-file.csv' # Replace with the object key (path)
}
)
print("Glue Job Triggered:", response['JobRunId'])
return {
'statusCode': 200,
'body': f"Triggered Glue Job {glue_job_name} with run ID {response['JobRunId']}"
}
except Exception as e:
print("Error triggering Glue Job:", e)
return {
'statusCode': 500,
'body': f"Error: {str(e)}"
}
✅ Lambda Settings
- Runtime: Python 3.9 or above
- Add environment variable:
GLUE_JOB_NAME = your-glue-job-name - Permissions: Attach IAM policy
AWSGlueConsoleFullAccessor custom policy withglue:StartJobRun
4️⃣ Create the AWS Glue Job
Here’s a basic PySpark Glue script:
🔥 AWS Glue PySpark Code
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from pyspark.context import SparkContext
from awsglue.job import Job
# Get parameters passed from Lambda
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'bucket', 'key'])
bucket = args['bucket']
key = args['key']
input_path = f"s3://{bucket}/{key}"
output_path = f"s3://{bucket}/processed-data/"
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# Read data
df = spark.read.option("header", True).csv(input_path)
# Sample transformation: filter non-null rows
df_clean = df.na.drop()
# Write back to S3
df_clean.write.mode("overwrite").parquet(output_path)
job.commit()
✅ Permissions Required
🔐 IAM Role for Lambda
{
"Effect": "Allow",
"Action": [
"glue:StartJobRun",
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "*"
}
🔍 Testing the Workflow
- Upload a file (e.g.,
sales_data.csv) to the S3 paths3://your-bucket/input-data/ - Check Lambda logs for Glue trigger
- Verify Glue job run in AWS Glue console
- Check output in
s3://your-bucket/processed-data/
🧠 Final Thoughts
This serverless setup allows you to build automated data pipelines in AWS with minimal operational overhead. Using S3 + EventBridge + Lambda + Glue, you can:
- Trigger ETL jobs in real-time
- Eliminate polling
- Maintain modular and scalable workflows






