When you write a few lines of PySpark code, Spark executes a complex distributed workflow behind the scenes. Many data engineers know how to write PySpark, but fewer truly understand how statements become stages, stages become tasks, and tasks run on partitions.
This blog demystifies the internal execution model of Spark by connecting these four concepts:
Statement → Job → Stage → Task → Partition
Once you understand this flow, you can:
- Debug slow jobs faster
- Optimize performance confidently
- Explain Spark execution in interviews
- Design better data pipelines on Databricks or AWS EMR
1. What Is a PySpark Statement?
A PySpark statement is a single line or logical block of PySpark code that performs an operation on a DataFrame or RDD.
Example:
df = spark.read.parquet("s3://bronze/orders/")df_filtered = df.filter("status = 'COMPLETE'")df_grouped = df_filtered.groupBy("country").count()df_grouped.write.mode("overwrite").parquet("s3://gold/orders_by_country/")
Each line looks simple, but Spark classifies them into two types of operations:
🔹 Transformations
These create a new DataFrame without executing immediately.
Examples:
select()filter()withColumn()join()groupBy()
🔹 Actions
These trigger execution.
Examples:
write()count()show()collect()
👉 Spark does nothing until it hits an action. This is called lazy evaluation.
2. From Statement to Spark Job
A Spark job is created only when an action is executed.
In our example:
df_grouped.write.parquet(...)
This line triggers one Spark job.
If you later run:
df_grouped.count()
Spark creates another job.
Rule:
Each action = one Spark job
You can see jobs in:
- Spark UI → Jobs tab
- Databricks job run UI
- EMR Spark History Server
3. How Spark Breaks a Job into Stages
Once a job is triggered, Spark analyzes the entire logical plan and splits it into stages.
What is a Stage?
A stage is a set of tasks that can be executed in parallel without requiring data to move between executors.
Spark creates a new stage whenever a shuffle is required.
4. The Shuffle: The Reason Stages Exist
Some operations require data to be redistributed across executors:
Operations that cause shuffle:
groupBy()join()(non-broadcast)distinct()orderBy()repartition()
Example:
df.groupBy("country").count()
Spark must:
- Send all rows with
country=USto one executor - All
country=INto another - etc.
This network movement is expensive and creates a stage boundary.
5. Example: Stages in a Simple PySpark Job
df = spark.read.parquet("s3://bronze/orders/")df_filtered = df.filter("status = 'COMPLETE'")df_grouped = df_filtered.groupBy("country").count()df_grouped.write.parquet("s3://gold/orders_by_country/")
Stage breakdown:
| Stage | What Happens |
|---|---|
| Stage 0 | Read parquet, filter rows |
| Stage 1 | Shuffle data by country, aggregate |
| Stage 2 | Write output to S3 |
👉 Even though you wrote 4 statements, Spark created 3 stages.
6. What Is a Task?
A task is the smallest unit of work Spark executes.
Each stage consists of many tasks.
Each task processes one partition.
Rule:
Number of tasks = number of partitions
If your DataFrame has 200 partitions, that stage runs 200 tasks in parallel (subject to available cores).
7. Partitions: The Heart of Parallelism
A partition is a chunk of data processed by one task.
When Spark reads data:
spark.read.parquet("s3://bronze/orders/")
Partitions are determined by:
- File count
- File size
spark.sql.files.maxPartitionBytes- Existing partitioning in S3
Example:
- 100 files → ~100 partitions
- 1 TB single file → split into many partitions
8. Partition → Task → Core Mapping
The execution flow is:
Partition → Task → Executor Core
If you have:
- 8 executors
- 4 cores per executor
- Total 32 cores
Spark can run:
- 32 tasks at the same time
If your stage has:
- 320 tasks → 10 waves of execution
This is why partition count directly affects performance.
9. How Tasks Run Inside a Stage
Inside a stage:
- All tasks run the same logic
- Each task processes different partition data
- Tasks are independent
- No communication between tasks
If one task is slow (data skew), the whole stage waits.
This is why skewed partitions kill performance.
10. Visualizing the Execution
Think of it like this:
Job ├── Stage 0 (100 tasks) │ ├── Task 1 (Partition 1) │ ├── Task 2 (Partition 2) │ └── ... ├── Stage 1 (200 tasks - shuffle) │ ├── Task 1 │ └── ... └── Stage 2 (100 tasks - write)
11. Repartition vs Coalesce: Stage Impact
repartition()
- Causes shuffle
- Creates new stage
- Even distribution
coalesce()
- No shuffle
- No new stage
- Merges partitions only
Example:
df.repartition(200) # new stagedf.coalesce(50) # same stage
12. Real-World Example (AWS / Databricks)
Suppose:
- Bronze S3 data: 500 partitions
- Silver layer: heavy joins
- Gold layer: aggregations
Typical stage flow:
- Read Bronze (Stage 0)
- Filter & enrich (Stage 1)
- Join dimension tables (Stage 2 – shuffle)
- Aggregate metrics (Stage 3 – shuffle)
- Write Gold (Stage 4)
👉 Understanding this helps you:
- Tune partition size
- Avoid unnecessary shuffles
- Optimize cluster cost
13. How to See Stages and Tasks
Use Spark UI:
- Jobs tab → number of jobs
- Stages tab → shuffle boundaries
- Tasks tab → skew, duration, failures
- SQL tab → logical vs physical plan
Or use:
df.explain(True)
14. Key Performance Rules to Remember
| Concept | Rule |
|---|---|
| Statement | Defines logic, not execution |
| Action | Triggers job |
| Shuffle | Creates new stage |
| Stage | Runs tasks in parallel |
| Task | Processes one partition |
| Partition | Controls parallelism |
| Too many partitions | Overhead |
| Too few partitions | Underutilization |
15. Final Summary
When you write PySpark code, Spark transforms it into:
- Statements → logical plan
- Action → job
- Shuffle boundaries → stages
- Partitions → tasks
- Tasks → executor cores






Start Discussion