How a PySpark Job Executes: Understanding Statements, Stages, and Tasks

When you write a few lines of PySpark code, Spark executes a complex distributed workflow behind the scenes. Many data engineers know how to write PySpark, but fewer truly understand how statements become stages, stages become tasks, and tasks run on partitions.

This blog demystifies the internal execution model of Spark by connecting these four concepts:

Statement → Job → Stage → Task → Partition

Once you understand this flow, you can:

Debug slow jobs faster
Optimize performance confidently
Explain Spark execution in interviews
Design better data pipelines on Databricks or AWS EMR

1. What Is a PySpark Statement?

A PySpark statement is a single line or logical block of PySpark code that performs an operation on a DataFrame or RDD.

Example:

			
df = spark.read.parquet("s3://bronze/orders/")
df_filtered = df.filter("status = 'COMPLETE'")
df_grouped = df_filtered.groupBy("country").count()
df_grouped.write.mode("overwrite").parquet("s3://gold/orders_by_country/")

Each line looks simple, but Spark classifies them into two types of operations:

🔹 Transformations

These create a new DataFrame without executing immediately.

Examples:

select()
filter()
withColumn()
join()
groupBy()

🔹 Actions

These trigger execution.

Examples:

write()
count()
show()
collect()

👉 Spark does nothing until it hits an action. This is called lazy evaluation.

2. From Statement to Spark Job

A Spark job is created only when an action is executed.

In our example:

df_grouped.write.parquet(...)

This line triggers one Spark job.

If you later run:

df_grouped.count()

Spark creates another job.

Rule:

Each action = one Spark job

You can see jobs in:

Spark UI → Jobs tab
Databricks job run UI
EMR Spark History Server

3. How Spark Breaks a Job into Stages

Once a job is triggered, Spark analyzes the entire logical plan and splits it into stages.

What is a Stage?

A stage is a set of tasks that can be executed in parallel without requiring data to move between executors.

Spark creates a new stage whenever a shuffle is required.

4. The Shuffle: The Reason Stages Exist

Some operations require data to be redistributed across executors:

Operations that cause shuffle:

groupBy()
join() (non-broadcast)
distinct()
orderBy()
repartition()

Example:

df.groupBy("country").count()

Spark must:

Send all rows with country=US to one executor
All country=IN to another
etc.

This network movement is expensive and creates a stage boundary.

5. Example: Stages in a Simple PySpark Job

			
df = spark.read.parquet("s3://bronze/orders/")
df_filtered = df.filter("status = 'COMPLETE'")
df_grouped = df_filtered.groupBy("country").count()
df_grouped.write.parquet("s3://gold/orders_by_country/")

Stage breakdown:

Stage	What Happens
Stage 0	Read parquet, filter rows
Stage 1	Shuffle data by country, aggregate
Stage 2	Write output to S3

👉 Even though you wrote 4 statements, Spark created 3 stages.

6. What Is a Task?

A task is the smallest unit of work Spark executes.

Each stage consists of many tasks.
Each task processes one partition.

Rule:

Number of tasks = number of partitions

If your DataFrame has 200 partitions, that stage runs 200 tasks in parallel (subject to available cores).

7. Partitions: The Heart of Parallelism

A partition is a chunk of data processed by one task.

When Spark reads data:

spark.read.parquet("s3://bronze/orders/")

Partitions are determined by:

File count
File size
spark.sql.files.maxPartitionBytes
Existing partitioning in S3

Example:

100 files → ~100 partitions
1 TB single file → split into many partitions

8. Partition → Task → Core Mapping

The execution flow is:

Partition → Task → Executor Core

If you have:

8 executors
4 cores per executor
Total 32 cores

Spark can run:

32 tasks at the same time

If your stage has:

320 tasks → 10 waves of execution

This is why partition count directly affects performance.

9. How Tasks Run Inside a Stage

Inside a stage:

All tasks run the same logic
Each task processes different partition data
Tasks are independent
No communication between tasks

If one task is slow (data skew), the whole stage waits.

This is why skewed partitions kill performance.

10. Visualizing the Execution

Think of it like this:

			
Job
 ├── Stage 0 (100 tasks)
 │     ├── Task 1 (Partition 1)
 │     ├── Task 2 (Partition 2)
 │     └── ...
 ├── Stage 1 (200 tasks - shuffle)
 │     ├── Task 1
 │     └── ...
 └── Stage 2 (100 tasks - write)

		

11. Repartition vs Coalesce: Stage Impact

`repartition()`

Causes shuffle
Creates new stage
Even distribution

`coalesce()`

No shuffle
No new stage
Merges partitions only

Example:

			
df.repartition(200)  # new stage
df.coalesce(50)      # same stage

12. Real-World Example (AWS / Databricks)

Suppose:

Bronze S3 data: 500 partitions
Silver layer: heavy joins
Gold layer: aggregations

Typical stage flow:

Read Bronze (Stage 0)
Filter & enrich (Stage 1)
Join dimension tables (Stage 2 – shuffle)
Aggregate metrics (Stage 3 – shuffle)
Write Gold (Stage 4)

👉 Understanding this helps you:

Tune partition size
Avoid unnecessary shuffles
Optimize cluster cost

13. How to See Stages and Tasks

Use Spark UI:

Jobs tab → number of jobs
Stages tab → shuffle boundaries
Tasks tab → skew, duration, failures
SQL tab → logical vs physical plan

Or use:

df.explain(True)

14. Key Performance Rules to Remember

Concept	Rule
Statement	Defines logic, not execution
Action	Triggers job
Shuffle	Creates new stage
Stage	Runs tasks in parallel
Task	Processes one partition
Partition	Controls parallelism
Too many partitions	Overhead
Too few partitions	Underutilization

15. Final Summary

When you write PySpark code, Spark transforms it into:

Statements → logical plan
Action → job
Shuffle boundaries → stages
Partitions → tasks
Tasks → executor cores

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.

Start Discussion Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Latest Posts

How a PySpark Job Executes: Understanding Statements, Stages, and Tasks

January 29, 2026
Azure Data Factory (ADF): The Complete Beginner-Friendly Guide (2026 Edition)

January 18, 2026
Complete Terraform CI/CD Pipeline Setup with GitHub Actions — Beginner to Advanced

January 13, 2026
AWS SageMaker + S3 Tutorial: Build, Train, and Deploy a LiDAR ML Model

January 12, 2026

How a PySpark Job Executes: Understanding Statements, Stages, and Tasks

1. What Is a PySpark Statement?

🔹 Transformations

🔹 Actions

2. From Statement to Spark Job

Rule:

3. How Spark Breaks a Job into Stages

What is a Stage?

4. The Shuffle: The Reason Stages Exist

5. Example: Stages in a Simple PySpark Job

Stage breakdown:

6. What Is a Task?

Rule:

7. Partitions: The Heart of Parallelism

8. Partition → Task → Core Mapping

9. How Tasks Run Inside a Stage

10. Visualizing the Execution

11. Repartition vs Coalesce: Stage Impact

repartition()

coalesce()

12. Real-World Example (AWS / Databricks)

13. How to See Stages and Tasks

14. Key Performance Rules to Remember

15. Final Summary

Share this:

Start Discussion Cancel reply

Latest Posts