Deleting S3 source objects after a Glue job run is complete is a common practice to streamline data management and ensure that it maintains relevant data. By automating, you can free up storage space and maintain a clean, organized dataset for future analyses.

PySpark Code Using Boto3
It explains reading arguments and how to delete source objects after glue completes the job run.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import boto3
# Arguments
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'SOURCE_S3_PATH'])
# Initialize Glue context and job
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# Your Glue job processing logic here
source_s3_path = args['SOURCE_S3_PATH']
# Example: read from S3
df = spark.read.csv(source_s3_path)
# Your processing code
# Assuming source_s3_path format is like: s3://bucket-name/path/to/source/file
s3_parts = source_s3_path.replace("s3://", "").split("/", 1)
bucket_name = s3_parts[0]
source_file_key = s3_parts[1]
# Initialize Boto3 S3 client
s3_client = boto3.client('s3')
# Delete the source file after processing
try:
response = s3_client.delete_object(Bucket=bucket_name, Key=source_file_key)
print(f"Deleted source file: {source_s3_path}")
except Exception as e:
print(f"Error deleting source file: {e}")
job.commit()
Difference Between coalesce and repartition
coalesce(n): Reduces the number of partitions (n)without a full shuffle. It is typically used to decrease the number of partitions to optimize performance for subsequent operations.repartition(n): Increases or decreases the number of partitions ()nwith a full shuffle of data across all nodes. It is used when you must increase the number of partitions or evenly redistribute the data.
Explanation of Key Components
- Boto3 S3 Client: Used to delete the file from the S3 bucket after the Glue job is completed.
delete_objectMethod: Deletes the specified object from S3.job.commit(): This ensures that the Glue job marks completion after deleting the source file.
How to Run
- Pass Arguments: Make sure to pass
SOURCE_S3_PATHas a parameter to your Glue job. - Permissions: Ensure the IAM role associated with your Glue job has the necessary S3 permissions (
s3:DeleteObject).
Reference







You must be logged in to post a comment.