Automating AWS Glue Job Trigger from S3 Upload via EventBridge and Lambda

In modern data pipelines, automation is key. A common requirement is to automatically trigger an AWS Glue ETL job whenever new data is uploaded to Amazon S3.

AWS provides an elegant, fully serverless way to achieve this using Amazon EventBridge and AWS Lambda.

In this blog, we’ll walk through a real-time data pipeline that:

Detects file uploads to Amazon S3
Sends events via EventBridge
Triggers a Lambda function
Starts an AWS Glue job

Let’s dive into the architecture and implementation 🚀

🏗️ Architecture Overview

S3 → EventBridge → Lambda → Glue Job

🔁 End-to-End Flow

Data Upload to S3
A new file lands in a specific S3 bucket.
EventBridge Rule
Captures the PutObject event from S3 and forwards it to a Lambda target.
Lambda Function
Parses the event and starts the corresponding Glue job.
AWS Glue Job
Performs the ETL (Extract, Transform, Load) processing.

📦 Prerequisites

Before you begin, ensure you have the following:

An AWS S3 bucket
An AWS Glue job created with a valid script
A Lambda execution role with permission to start Glue jobs
An EventBridge rule linked to S3 events
A Lambda function using Python

🔧 Step-by-Step Implementation

1️⃣ Create an S3 Bucket and Upload Sample Data

Upload your files to a folder path such as:

s3://your-bucket/input-data/

💡 You do not need to configure S3 Event Notifications manually—EventBridge can capture S3 events automatically.

2️⃣ Create an EventBridge Rule

Navigate to:

Amazon EventBridge → Rules → Create rule

Rule Details

Name: S3ToGlueTriggerRule
Event pattern:

{
  "source": ["aws.s3"],
  "detail-type": ["Object Created"],
  "detail": {
    "bucket": {
      "name": ["your-bucket"]
    },
    "object": {
      "key": [{
        "prefix": "input-data/"
      }]
    }
  }
}

Target

Target type: Lambda function
Target: (Select the Lambda function created in the next step)

3️⃣ Create the Lambda Function

This Lambda function will listen to S3 events and trigger the Glue job.

🐍 Lambda Python Code

import boto3
import json
import os

glue = boto3.client('glue')

def lambda_handler(event, context):
    print("Received Event:", json.dumps(event))
    
    # Extract S3 bucket and object key
    bucket = event['detail']['bucket']['name']
    key = event['detail']['object']['key']
    
    glue_job_name = os.environ['GLUE_JOB_NAME']  # Environment variable

    try:
        response = glue.start_job_run(
            JobName=glue_job_name,
            Arguments={
                '--bucket': bucket,
                '--key': key
            }
        )
        print("Glue Job Triggered:", response['JobRunId'])
        return {
            'statusCode': 200,
            'body': f"Triggered Glue Job {glue_job_name} with run ID {response['JobRunId']}"
        }
    except Exception as e:
        print("Error triggering Glue Job:", e)
        return {
            'statusCode': 500,
            'body': f"Error: {str(e)}"
        }

✅ Lambda Configuration

Runtime: Python 3.9 or above
Environment Variable: GLUE_JOB_NAME = your-glue-job-name
Permissions:
- glue:StartJobRun
- CloudWatch Logs permissions

⚠️ Avoid using full admin policies in production—use least privilege.

4️⃣ Create the AWS Glue Job

Below is a basic PySpark Glue script that reads CSV data from S3, performs a simple transformation, and writes the output back to S3.

🔥 AWS Glue PySpark Script

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from pyspark.context import SparkContext
from awsglue.job import Job

# Get parameters passed from Lambda
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'bucket', 'key'])

bucket = args['bucket']
key = args['key']

input_path = f"s3://{bucket}/{key}"
output_path = f"s3://{bucket}/processed-data/"

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read data
df = spark.read.option("header", True).csv(input_path)

# Sample transformation: drop null rows
df_clean = df.na.drop()

# Write output back to S3
df_clean.write.mode("overwrite").parquet(output_path)

job.commit()

🔐 Required IAM Permissions

IAM Policy for Lambda Execution Role

{
  "Effect": "Allow",
  "Action": [
    "glue:StartJobRun",
    "logs:CreateLogGroup",
    "logs:CreateLogStream",
    "logs:PutLogEvents"
  ],
  "Resource": "*"
}

🔍 Testing the Workflow

Upload a file (e.g., sales_data.csv) to: s3://your-bucket/input-data/
Verify:
- Lambda logs in CloudWatch
- Glue job execution in AWS Glue Console
- Output files in: s3://your-bucket/processed-data/

🧠 Final Thoughts

This serverless event-driven architecture enables fully automated ETL pipelines in AWS with minimal operational overhead.

By combining S3 + EventBridge + Lambda + Glue, you can:

Trigger ETL jobs in real time
Eliminate inefficient polling
Build scalable, modular, and cost-effective data pipelines

If you’re designing modern data platforms on AWS, this pattern is a must-have in your toolkit ✅

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.

Latest Posts

Secure S3 File Upload Using API Gateway, Lambda & PostgreSQL (Complete AWS Architecture Guide

March 14, 2026
AI Agents in Data Engineering: Everything You Need to Know

March 8, 2026
The End-to-End AI Stack – A Real Guide for Developers to Code, Create, and Execute

March 2, 2026
10 Workplace Communication Speaking Exercises to Improve Fluency at Work

February 28, 2026