S3 to Redshift Data Ingestion: ETL Project for Beginners

Moving data from Amazon S3 into Amazon Redshift is a common task for building analytics pipelines. In this project, we’ll walk through a step-by-step approach to load CSV data from S3 into a Redshift table efficiently and reliably.

🔹 Project Overview

Goal:

Load raw CSV files from S3 into Redshift.
Prepare the data for analytics and reporting.

Tech Stack:

Amazon S3 → stores raw CSV files.
Amazon Redshift → analytics data warehouse.
IAM Role → grants Redshift access to S3.
SQL COPY command → bulk loads CSV data.

1️⃣ Prepare Your CSV Data in S3

Assume we have sales data in S3:

s3://my-bucket/sales_data/sales_2025_01.csv
s3://my-bucket/sales_data/sales_2025_02.csv

Example CSV content:

order_id,order_date,customer_id,amount
1001,2025-01-01,2001,250.50
1002,2025-01-02,2002,300.00
1003,2025-01-05,2003,150.75

⚠️ Make sure all CSV files have the same schema, otherwise the load may fail.

2️⃣ Create the Redshift Table

Create a table that matches your CSV schema:

CREATE TABLE sales (
    order_id     INT,
    order_date   DATE,
    customer_id  INT,
    amount       DECIMAL(10,2)
)
DISTKEY(customer_id)
SORTKEY(order_date);

DISTKEY → optimizes joins on customer_id.
SORTKEY → improves query performance for order_date filters.

3️⃣ Grant Redshift Access to S3

Create an IAM Role (e.g., RedshiftS3ReadRole).
Attach AmazonS3ReadOnlyAccess policy.
Attach the IAM role to your Redshift cluster.

This allows Redshift to securely read files from S3.

4️⃣ Load Data Using COPY

The COPY command efficiently loads data from S3:

COPY sales
FROM 's3://my-bucket/sales_data/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3ReadRole'
FORMAT AS CSV
DELIMITER ','
IGNOREHEADER 1
QUOTE AS '"'
REGION 'us-east-1';

Notes:

IGNOREHEADER 1 → skips the header row.
DELIMITER ',' → ensures correct CSV parsing.
QUOTE AS '"' → handles quoted fields.
Load multiple files by specifying the S3 folder path.

5️⃣ Validate the Load

Check your data in Redshift:

SELECT COUNT(*) FROM sales;
SELECT * FROM sales LIMIT 10;

✅ Optional: For production, also check for duplicates, nulls, or schema mismatches.

6️⃣ Automate the Pipeline

For recurring CSV uploads, consider:

AWS Glue → ETL jobs to transform and load data.
AWS Lambda → trigger COPY on S3 file uploads.
Airflow or Step Functions → orchestrate automated pipelines.
S3 prefix filtering or manifest files → prevent duplicate loads.

🔹 Key Takeaways

COPY command is the most efficient way to load large CSVs.
Use DISTKEY and SORTKEY for query optimization.
Ensure consistent schema across CSV files.
Automate using Glue, Lambda, or orchestration tools for production pipelines.
S3 “partitioning” helps COPY filter files faster but does not partition Redshift storage.

✅ Conclusion

In this project, we built a full S3-to-Redshift pipeline:

Uploaded CSV files to S3.
Created a Redshift table with appropriate distribution and sort keys.
Configured IAM roles for secure access.
Loaded CSV data efficiently with the COPY command.
Validated and automated the workflow.

This workflow is production-ready and forms the foundation for scalable data engineering pipelines.

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.