Deleting S3 source objects after a Glue job run is complete is a common practice to streamline data management and ensure that it maintains relevant data. By automating, you can free up storage space and maintain a clean, organized dataset for future analyses.

How to delete source object after Glue job run complete

PySpark Code Using Boto3

It explains reading arguments and how to delete source objects after glue completes the job run.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import boto3

# Arguments
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'SOURCE_S3_PATH'])

# Initialize Glue context and job
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Your Glue job processing logic here
source_s3_path = args['SOURCE_S3_PATH']
# Example: read from S3
df = spark.read.csv(source_s3_path)
# Your processing code

# Assuming source_s3_path format is like: s3://bucket-name/path/to/source/file
s3_parts = source_s3_path.replace("s3://", "").split("/", 1)
bucket_name = s3_parts[0]
source_file_key = s3_parts[1]

# Initialize Boto3 S3 client
s3_client = boto3.client('s3')

# Delete the source file after processing
try:
response = s3_client.delete_object(Bucket=bucket_name, Key=source_file_key)
print(f"Deleted source file: {source_s3_path}")
except Exception as e:
print(f"Error deleting source file: {e}")

job.commit()

Difference Between coalesce and repartition

  • coalesce(n): Reduces the number of partitions (n) without a full shuffle. It is typically used to decrease the number of partitions to optimize performance for subsequent operations.
  • repartition(n): Increases or decreases the number of partitions ()n with a full shuffle of data across all nodes. It is used when you must increase the number of partitions or evenly redistribute the data.

Explanation of Key Components

  • Boto3 S3 Client: Used to delete the file from the S3 bucket after the Glue job is completed.
  • delete_object Method: Deletes the specified object from S3.
  • job.commit(): This ensures that the Glue job marks completion after deleting the source file.

How to Run

  1. Pass Arguments: Make sure to pass SOURCE_S3_PATH as a parameter to your Glue job.
  2. Permissions: Ensure the IAM role associated with your Glue job has the necessary S3 permissions (s3:DeleteObject).

Reference