In modern data pipelines, automation is key. A common requirement is to automatically trigger a Glue ETL job every time new data lands in Amazon S3. AWS provides an elegant, serverless way to achieve this using EventBridge and Lambda. In this blog, we’ll walk through a real-time data pipeline that:

  • Detects file uploads to S3
  • Sends events via EventBridge
  • Triggers a Lambda function
  • Starts an AWS Glue job

Let’s dive into the architecture and implementation.

🏗️ Architecture Overview

S3 → EventBridge → Lambda → Glue Job

🔁 Flow:

  1. Data Upload to S3: A new file lands in a specific S3 bucket.
  2. EventBridge Rule: Captures the PutObject event from S3 and sends it to a Lambda target.
  3. Lambda Function: Parses the event and starts the corresponding Glue job.
  4. Glue Job: Performs the ETL (Extract, Transform, Load) task.

📦 Prerequisites

  • AWS S3 bucket
  • AWS Glue job created (with a valid script)
  • Lambda execution role with permission to start Glue jobs
  • EventBridge rule linked to S3 events
  • Lambda function with Python code

🔧 Step-by-Step Implementation

1️⃣ Create an S3 Bucket and Upload Sample Data

Upload your files to a folder path like:

s3://your-bucket/input-data/

Enable event notifications in S3 (optional) or let EventBridge capture them automatically.

2️⃣ Create an EventBridge Rule

Go to EventBridge → Rules → Create Rule

  • Name: S3ToGlueTriggerRule
  • Event Pattern:
{
  "source": ["aws.s3"],
  "detail-type": ["Object Created"],
  "detail": {
    "bucket": {
      "name": ["your-bucket"]
    },
    "object": {
      "key": [{
        "prefix": "input-data/"
      }]
    }
  }
}
  • Target: Lambda Function (we’ll create it next)

3️⃣ Create the Lambda Function

Use the following Python code to trigger a Glue job:

🐍 Lambda Python Code

import boto3
import json
import os

glue = boto3.client('glue')

def lambda_handler(event, context):
    print("Received Event:", json.dumps(event))
    
    # Extract S3 bucket and object key
    bucket = event['detail']['bucket']['name']
    key = event['detail']['object']['key']
    
    glue_job_name = os.environ['GLUE_JOB_NAME']  # Get job name from environment variable

    try:
        response = glue.start_job_run(
            JobName=glue_job_name,
            Arguments={
                '--bucket': 'bucket', # Replace with your S3 bucket nam
                '--key': 'raw/input-file.csv' # Replace with the object key (path)
            }
        )
        print("Glue Job Triggered:", response['JobRunId'])
        return {
            'statusCode': 200,
            'body': f"Triggered Glue Job {glue_job_name} with run ID {response['JobRunId']}"
        }
    except Exception as e:
        print("Error triggering Glue Job:", e)
        return {
            'statusCode': 500,
            'body': f"Error: {str(e)}"
        }

✅ Lambda Settings

  • Runtime: Python 3.9 or above
  • Add environment variable: GLUE_JOB_NAME = your-glue-job-name
  • Permissions: Attach IAM policy AWSGlueConsoleFullAccess or custom policy with glue:StartJobRun

4️⃣ Create the AWS Glue Job

Here’s a basic PySpark Glue script:

🔥 AWS Glue PySpark Code

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from pyspark.context import SparkContext
from awsglue.job import Job

# Get parameters passed from Lambda
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'bucket', 'key'])

bucket = args['bucket']
key = args['key']

input_path = f"s3://{bucket}/{key}"
output_path = f"s3://{bucket}/processed-data/"

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read data
df = spark.read.option("header", True).csv(input_path)

# Sample transformation: filter non-null rows
df_clean = df.na.drop()

# Write back to S3
df_clean.write.mode("overwrite").parquet(output_path)

job.commit()

✅ Permissions Required

🔐 IAM Role for Lambda

{
"Effect": "Allow",
"Action": [
"glue:StartJobRun",
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "*"
}

🔍 Testing the Workflow

  1. Upload a file (e.g., sales_data.csv) to the S3 path s3://your-bucket/input-data/
  2. Check Lambda logs for Glue trigger
  3. Verify Glue job run in AWS Glue console
  4. Check output in s3://your-bucket/processed-data/

🧠 Final Thoughts

This serverless setup allows you to build automated data pipelines in AWS with minimal operational overhead. Using S3 + EventBridge + Lambda + Glue, you can:

  • Trigger ETL jobs in real-time
  • Eliminate polling
  • Maintain modular and scalable workflows