Trigger AWS Glue Job from Lambda After S3 Upload

An AWS Lambda function can trigger an AWS Glue job after uploading a file into an S3 bucket. Here are the steps.

AWS Lambda to Trigger Glue Job Prerequisites

AWS Lambda Function: Configured with necessary IAM roles.
AWS Glue Job: Created and ready to be triggered.
S3 Bucket: Where the files are uploaded.

Steps

Configure S3 Event Trigger for Lambda
- Go to the S3 bucket where the files are uploaded.
- Create a trigger for the Lambda function by configuring S3 events (e.g., for PUT or POST operations).
Create Lambda Function to Trigger Glue Job: use the boto3 library to interact with AWS Glue. Here’s a basic example of the code:

import json
import boto3

def lambda_handler(event, context):
    # Log the S3 event
    print("Received event: " + json.dumps(event, indent=2))
    
    # Extract the bucket name and file key from the event
    bucket_name = event['Records'][0]['s3']['bucket']['name']
    file_key = event['Records'][0]['s3']['object']['key']
    print(f"Bucket: {bucket_name}, Key: {file_key}")
    
    # Initialize the boto3 client for AWS Glue
    glue_client = boto3.client('glue')

    # Name of your Glue Job
    glue_job_name = "your-glue-job-name"

    try:
        # Start the Glue job
        response = glue_client.start_job_run(
            JobName=glue_job_name,
            Arguments={
                '--bucket_name': bucket_name,
                '--file_key': file_key
            }
        )
        print(f"Started Glue job: {response['JobRunId']}")
        return {
            'statusCode': 200,
            'body': json.dumps('Glue job started successfully!')
        }
    except Exception as e:
        print(e)
        return {
            'statusCode': 500,
            'body': json.dumps('Error starting Glue job')
        }

IAM Role Configuration for Lambda:
- s3:GetObject is required to read objects from the S3 bucket.
- glue:StartJobRun is required to trigger a Glue job.
- You should attach the managed policy AWSGlueServiceRole to the IAM role to provide Glue-related permissions.
- Additionally, add an S3 bucket policy to grant access to your specific bucket for GetObject actions.

S3 Access

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": "arn:aws:s3:::your-bucket-name/*"
        }
    ]
}

Glue Job Access

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "glue:StartJobRun",
            "Resource": "arn:aws:glue:region:account-id:job/your-glue-job-name"
        }
    ]
}

Testing and Deploying
- Upload a file to the S3 bucket.
- Check Lambda logs in CloudWatch to ensure the job is triggered and runs successfully.

This process starts a Glue job whenever a file is uploaded to the S3 bucket. The Lambda function extracts the bucket and file details from the event. It passes these details as arguments to the Glue job.

Super Quality Mic for School, Work, and to Make Videos
~ Cyber Acoustics USB Microphone

The `json.dumps` why do we need to use

The line json.dumps('Error starting Glue job') is part of the Lambda function’s error-handling mechanism. Here’s a breakdown of its purpose:

Explanation

json.dumps(): to serialize (convert) a Python object (like a string, dictionary, list, etc.) into a JSON-formatted string. JSON (JavaScript Object Notation) is a lightweight data-interchange format that will be used in structured data across systems.
'Error starting Glue job': it indicates an error occurred when attempting to start the AWS Glue job.

Why `json.dumps()` here?

The return value from lambda_handler the function is expected to be in JSON format. By using, the error message is serialized into JSON format before being sent as part of the Lambda response.

Full Context in the Code

statusCode: 500: Indicates an internal server error in the HTTP response (500 is a general error status).
body: Contains the message, which in this case, is 'Error starting Glue job' converted to a JSON string via json.dumps().

This helps ensure that the response from the Lambda function is in a consistent JSON format. This consistency is useful when integrating Lambda with other services like API Gateway or Step Functions.

How an S3 event appears

An event triggers the Lambda function when a file is uploaded to an S3 bucket. Here is an example of such an event. The structure is based on the S3 event notification that AWS generates when an object is added to an S3 bucket:

Example S3 Event

Key Fields in the Event:

Records: It is an array that holds one or more S3 events. Each event corresponds to an S3 object action (in this case, an object upload).
eventSource: The AWS service that triggered the event, is "aws:s3" here.
eventName: The specific event type. In this case, it means an object was uploaded to S3.
bucket: Contains information about the S3 bucket where the event occurred:
- name: The name of the S3 bucket.
- arn: The Amazon Resource Name (ARN) of the bucket.
object: Contains information about the object (file) that was uploaded:
- key: The key (filename) of the object within the bucket. In this example, it’s "uploaded-file.txt".
- size: The size of the object in bytes.
- eTag: The unique identifier for the uploaded object.

Usage in Lambda

In your Lambda function, extract relevant information from this event to pass to the AWS Glue job. For example:

bucket name: event['Records'][0]['s3']['bucket']['name']
file key: event['Records'][0]['s3']['object']['key']

You can then pass these fields as arguments to the AWS Glue job. This is demonstrated in the previous Lambda function example.

Sample Lambda code: The file is uploaded to a bucket with subfolders.

To handle a scenario where the file is uploaded to an S3 bucket with two subfolders (for example: s3://your-bucket/subfolder1/subfolder2/myfile.txt), you can modify your AWS Lambda function to extract the full S3 key (including subfolders) and trigger an AWS Glue job.

Here’s a sample Lambda code that triggers the AWS Glue job when a file is uploaded:

Lambda Function Code

import json
import boto3

def lambda_handler(event, context):
    # Log the received event for debugging
    print("Received event: " + json.dumps(event, indent=2))
    
    # Extract the bucket name and object key (file path) from the event
    bucket_name = event['Records'][0]['s3']['bucket']['name']
    file_key = event['Records'][0]['s3']['object']['key']
    
    # Log the bucket and file path for debugging
    print(f"Bucket: {bucket_name}, File Key: {file_key}")
    
    # Initialize the boto3 client for AWS Glue
    glue_client = boto3.client('glue')

    # The name of your Glue job
    glue_job_name = "your-glue-job-name"

    try:
        # Start the Glue job, passing the bucket and file information as job arguments
        response = glue_client.start_job_run(
            JobName=glue_job_name,
            Arguments={
                '--bucket_name': bucket_name,
                '--file_key': file_key
            }
        )
        print(f"Started Glue job with JobRunId: {response['JobRunId']}")
        return {
            'statusCode': 200,
            'body': json.dumps('Glue job started successfully!')
        }
    except Exception as e:
        # Log any error that occurs
        print(f"Error starting Glue job: {e}")
        return {
            'statusCode': 500,
            'body': json.dumps('Error starting Glue job')
        }

Key Points:

Extracting the S3 Bucket and Key:
- The bucket_name is extracted from the S3 event using event['Records'][0]['s3']['bucket']['name'].
- The file_key is the full path of the file including subfolders (subfolder1/subfolder2/myfile.txt), and it’s extracted using event['Records'][0]['s3']['object']['key'].
Triggering the Glue Job:
- The start_job_run method triggers the Glue job, passing the S3 bucket and file key as arguments (--bucket_name and --file_key).
- These arguments can then be used inside the Glue job’s script to reference the file for processing.
Error Handling:
- When there is an issue with starting the Glue job, the system catches the exception. It then returns a 500 error with a message indicating the failure.

Example S3 Event for File in Subfolders:

{
  "Records": [
    {
      "eventVersion": "2.1",
      "eventSource": "aws:s3",
      "awsRegion": "us-east-1",
      "eventTime": "2024-10-15T15:16:47.000Z",
      "eventName": "ObjectCreated:Put",
      "userIdentity": {
        "principalId": "AWSUSERID"
      },
      "requestParameters": {
        "sourceIPAddress": "123.456.789.000"
      },
      "responseElements": {
        "x-amz-request-id": "EXAMPLE12345ABC",
        "x-amz-id-2": "EXAMPLEID/XYZ12345="
      },
      "s3": {
        "s3SchemaVersion": "1.0",
        "configurationId": "example-config-id",
        "bucket": {
          "name": "your-bucket",
          "ownerIdentity": {
            "principalId": "AWSOWNERID"
          },
          "arn": "arn:aws:s3:::your-bucket"
        },
        "object": {
          "key": "subfolder1/subfolder2/myfile.txt",
          "size": 2048,
          "eTag": "0123456789abcdef0123456789abcdef",
          "sequencer": "0123456789ABCDEF"
        }
      }
    }
  ]
}

How the Glue Job Will Use Arguments

Within the AWS Glue job script, you can access the passed arguments like this:

import sys
from awsglue.utils import getResolvedOptions

# Get the passed bucket name and file key
args = getResolvedOptions(sys.argv, ['bucket_name', 'file_key'])

bucket_name = args['bucket_name']
file_key = args['file_key']

# You can now use bucket_name and file_key in your Glue job logic
print(f"Processing file: s3://{bucket_name}/{file_key}")

This code will trigger the AWS Glue job each time a file is uploaded to the S3 bucket. This includes files in nested folders. It will pass the full path to the file as an argument to the Glue job for further processing.

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.