Automating AWS Glue Job Trigger from S3 Upload via EventBridge and Lambda

In modern data pipelines, automation is key. A common requirement is to automatically trigger a Glue ETL job every time new data lands in Amazon S3. AWS provides an elegant, serverless way to achieve this using EventBridge and Lambda. In this blog, we’ll walk through a real-time data pipeline that:

Detects file uploads to S3
Sends events via EventBridge
Triggers a Lambda function
Starts an AWS Glue job

Let’s dive into the architecture and implementation.

🏗️ Architecture Overview

S3 → EventBridge → Lambda → Glue Job

🔁 Flow:

Data Upload to S3: A new file lands in a specific S3 bucket.
EventBridge Rule: Captures the PutObject event from S3 and sends it to a Lambda target.
Lambda Function: Parses the event and starts the corresponding Glue job.
Glue Job: Performs the ETL (Extract, Transform, Load) task.

📦 Prerequisites

AWS S3 bucket
AWS Glue job created (with a valid script)
Lambda execution role with permission to start Glue jobs
EventBridge rule linked to S3 events
Lambda function with Python code

🔧 Step-by-Step Implementation

1️⃣ Create an S3 Bucket and Upload Sample Data

Upload your files to a folder path like:

s3://your-bucket/input-data/

Enable event notifications in S3 (optional) or let EventBridge capture them automatically.

2️⃣ Create an EventBridge Rule

Go to EventBridge → Rules → Create Rule

Name: S3ToGlueTriggerRule
Event Pattern:

{
  "source": ["aws.s3"],
  "detail-type": ["Object Created"],
  "detail": {
    "bucket": {
      "name": ["your-bucket"]
    },
    "object": {
      "key": [{
        "prefix": "input-data/"
      }]
    }
  }
}

Target: Lambda Function (we’ll create it next)

3️⃣ Create the Lambda Function

Use the following Python code to trigger a Glue job:

🐍 Lambda Python Code

import boto3
import json
import os

glue = boto3.client('glue')

def lambda_handler(event, context):
    print("Received Event:", json.dumps(event))
    
    # Extract S3 bucket and object key
    bucket = event['detail']['bucket']['name']
    key = event['detail']['object']['key']
    
    glue_job_name = os.environ['GLUE_JOB_NAME']  # Get job name from environment variable

    try:
        response = glue.start_job_run(
            JobName=glue_job_name,
            Arguments={
                '--bucket': 'bucket', # Replace with your S3 bucket nam
                '--key': 'raw/input-file.csv' # Replace with the object key (path)
            }
        )
        print("Glue Job Triggered:", response['JobRunId'])
        return {
            'statusCode': 200,
            'body': f"Triggered Glue Job {glue_job_name} with run ID {response['JobRunId']}"
        }
    except Exception as e:
        print("Error triggering Glue Job:", e)
        return {
            'statusCode': 500,
            'body': f"Error: {str(e)}"
        }

✅ Lambda Settings

Runtime: Python 3.9 or above
Add environment variable: GLUE_JOB_NAME = your-glue-job-name
Permissions: Attach IAM policy AWSGlueConsoleFullAccess or custom policy with glue:StartJobRun

4️⃣ Create the AWS Glue Job

Here’s a basic PySpark Glue script:

🔥 AWS Glue PySpark Code

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from pyspark.context import SparkContext
from awsglue.job import Job

# Get parameters passed from Lambda
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'bucket', 'key'])

bucket = args['bucket']
key = args['key']

input_path = f"s3://{bucket}/{key}"
output_path = f"s3://{bucket}/processed-data/"

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read data
df = spark.read.option("header", True).csv(input_path)

# Sample transformation: filter non-null rows
df_clean = df.na.drop()

# Write back to S3
df_clean.write.mode("overwrite").parquet(output_path)

job.commit()

✅ Permissions Required

🔐 IAM Role for Lambda

{
  "Effect": "Allow",
  "Action": [
    "glue:StartJobRun",
    "logs:CreateLogGroup",
    "logs:CreateLogStream",
    "logs:PutLogEvents"
  ],
  "Resource": "*"
}

🔍 Testing the Workflow

Upload a file (e.g., sales_data.csv) to the S3 path s3://your-bucket/input-data/
Check Lambda logs for Glue trigger
Verify Glue job run in AWS Glue console
Check output in s3://your-bucket/processed-data/

🧠 Final Thoughts

This serverless setup allows you to build automated data pipelines in AWS with minimal operational overhead. Using S3 + EventBridge + Lambda + Glue, you can:

Trigger ETL jobs in real-time
Eliminate polling
Maintain modular and scalable workflows

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.