When building data pipelines on AWS, you’ll often combine AWS Glue with services like S3, Kinesis, Kafka, SNS, EventBridge, Lambda, Redshift, and Athena. Each service plays a key role in data ingestion, transformation, and analytics. In this guide, we’ll cover the most crucial aspects of these services — including parallelism, scalability, performance optimization, failure handling, and access control. Whether you are preparing for an interview or working on real-world projects, this blog will help you quickly refresh the fundamentals of AWS Glue and its related ecosystem.
I cover:
- Parallelism
- Failures & Failure handling
- Performance
- Scalability
- Access & Permissions
This way, you get a complete refresh in ~1 hour.
🔹 AWS Glue
Parallelism
- Achieved via Spark executors/partitions, workers (Standard/G.1X/G.2X), and JDBC partitioned reads.
- DynamicFrames allow parallel transformations.
Failures & Handling
- Job retries configurable.
- Streaming jobs use checkpointing.
- Errors logged to CloudWatch.
- Schema evolution handled via Glue Data Catalog updates.
Performance
- Predicate pushdown, partition pruning.
- Columnar formats (Parquet/ORC).
- Optimize
spark.sql.shuffle.partitions. - Use DataFrames instead of DynamicFrames where possible.
Scalability
- Auto-scaling workers.
- Can run workflows with multiple dependent jobs.
- Handles TB–PB scale data when partitioned well.
Access & Permissions
- IAM role required with S3/DB/Redshift/Kinesis permissions.
- Lake Formation can provide fine-grained table/column security.
🔹 Amazon S3
Parallelism
- Parallel reads/writes by Glue, Athena, Redshift COPY, or Lambda using multiple threads.
Failures & Handling
- Object upload retries (S3 auto-retries on failure).
- Versioning to recover deleted/overwritten files.
Performance
- Use Parquet/ORC instead of CSV/JSON.
- Partition folders by date/region.
- Avoid many small files (optimize file size ~128 MB).
Scalability
- Virtually unlimited objects/storage.
- Scales automatically with request rates.
Access & Permissions
- Controlled via IAM roles, bucket policies, ACLs.
- Can use KMS encryption for compliance.
🔹 Amazon Kinesis
Parallelism
- Parallelism via shards (each shard = 1 MB/sec write, 2 MB/sec read).
- Glue or Lambda can process shards in parallel.
Failures & Handling
- Data stored up to 7 days (retry window).
- Dead-letter queues (DLQ) with Lambda for failed records.
- Consumer retries on checkpoint failures.
Performance
- Increase shard count for throughput.
- Use enhanced fan-out for high-performance consumers.
Scalability
- Scale shards up/down elastically.
- Managed by AWS, scales with demand.
Access & Permissions
- IAM roles/policies for producers/consumers.
- VPC endpoints for private access.
🔹 Apache Kafka (MSK)
Parallelism
- Parallelism via partitions per topic.
- Consumers can read partitions in parallel.
Failures & Handling
- Offsets stored in Kafka or external DBs.
- Consumer retries configurable.
- Replication across brokers for fault tolerance.
Performance
- Tune partition count, batch size, compression.
- Use multiple brokers for throughput.
Scalability
- Add more brokers & partitions.
- Horizontally scalable cluster.
Access & Permissions
- IAM (for MSK) or SASL/SSL for client authentication.
- Fine-grained access via Kafka ACLs.
🔹 Amazon SNS
Parallelism
- Fan-out to multiple subscribers in parallel (Lambda, SQS, HTTP).
Failures & Handling
- Retry policy for undelivered messages.
- DLQ support with SQS.
Performance
- High throughput pub/sub.
- Message filtering for efficient delivery.
Scalability
- Fully managed, scales automatically to millions of messages/sec.
Access & Permissions
- Controlled with IAM policies and topic policies.
- Encryption with KMS supported.
🔹 Amazon EventBridge
Parallelism
- Routes events to multiple targets in parallel.
Failures & Handling
- Retries failed targets.
- DLQ via SQS.
- Archive/replay for event recovery.
Performance
- Near real-time delivery.
- Rule filtering to reduce unnecessary processing.
Scalability
- Scales automatically with event volume.
Access & Permissions
- IAM roles required for publishing/consuming events.
- Resource-based policies on event buses.
🔹 AWS Lambda
Parallelism
- Event-driven: each event triggers one function instance.
- Account concurrency limit (default 1,000, can increase).
Failures & Handling
- Automatic retries (for async invokes).
- DLQ/SNS/SQS for failed events.
Performance
- Cold start overhead (mitigated with provisioned concurrency).
- Tune memory → improves CPU/network.
Scalability
- Scales automatically with request load.
- Burst concurrency supported.
Access & Permissions
- IAM execution role defines access.
- Can run inside VPC for private resources.
AWS Lambda itself does not manage schema evolution.
Lambda is just a compute service — it processes the payload (JSON, Avro, Parquet, etc.) that you send in. Schema evolution comes into play when Lambda is integrated with data services like Glue, Kafka, Kinesis, or EventBridge.
🔹 How Schema Evolution Relates to Lambda
- With AWS Glue Data Catalog
- If your Lambda function reads data from an S3 bucket registered in Glue, the schema is stored in the Glue Data Catalog.
- Schema evolution (new columns, type changes) is managed in Glue, not in Lambda.
- Your Lambda code must be written defensively (e.g., handle missing/new fields in JSON).
- With AWS Glue Schema Registry (Kafka/Kinesis)
- If Lambda consumes from Kafka (MSK) or Kinesis with a schema registered in Glue Schema Registry, it validates the message schema before processing.
- Schema evolution (like backward/forward compatibility rules) is enforced by the Schema Registry, not Lambda.
- If the schema breaks compatibility, Lambda may fail unless you add fallback handling.
- With EventBridge
- EventBridge allows schema discovery and stores event structure in the Schema Registry.
- Lambda subscribers can use this schema (code bindings can be auto-generated).
- If the schema evolves, Lambda still gets the event, but your function must be updated to handle new/changed fields.
🔹 Best Practices for Handling Schema Evolution in Lambda
- Use default values for missing fields when parsing payloads.
- Wrap parsing logic in
try/exceptto handle unknown/new fields gracefully. - Version control your schema in Glue Schema Registry if using Kafka/Kinesis.
- Test backward compatibility before deploying Lambda updates.
- Log unexpected fields so you can adapt your function when schema changes.
✅ Summary:
Schema evolution is not native to Lambda — it’s handled by Glue Schema Registry (for Kafka/Kinesis), Glue Data Catalog (for S3/ETL), or EventBridge Schema Registry. Lambda just needs to be coded flexibly to handle evolving schemas without breaking.
🔹 Amazon Redshift
Parallelism
- MPP (Massively Parallel Processing) architecture with leader + compute nodes.
- COPY command loads data in parallel from S3.
Failures & Handling
- Cluster snapshots and automated backups.
- Query retries via WLM (Workload Management).
Performance
- Use columnar storage & compression.
- Sort keys & distribution keys tuned.
- Spectrum for external S3 queries.
Scalability
- RA3 nodes → separate compute & storage scaling.
- Elastic Resize for cluster scaling.
Access & Permissions
- IAM roles for S3/Glue access.
- Redshift-specific users/roles for table-level permissions.
🔹 Amazon Athena
Parallelism
- Queries execute in parallel across multiple nodes.
- Parallel reads from S3 partitions.
Failures & Handling
- Fails if schema mismatch, bad data, or permission issues.
- Retry by fixing schema/data.
Performance
- Partition pruning.
- Columnar storage (Parquet/ORC).
- Avoid small files (compaction).
Scalability
- Serverless; scales automatically with query size.
- Handles TB–PB of data depending on partitions.
Access & Permissions
- Uses Glue Data Catalog or Hive Metastore.
- IAM policies for S3 + Athena queries.
✅ This gives you a complete 360° view for each service in interview language.






