When you write a few lines of PySpark code, Spark executes a complex distributed workflow behind the scenes. Many data engineers know how to write PySpark, but fewer truly understand how statements become stages, stages become tasks, and tasks run on partitions.

This blog demystifies the internal execution model of Spark by connecting these four concepts:

Statement → Job → Stage → Task → Partition

Once you understand this flow, you can:

  • Debug slow jobs faster
  • Optimize performance confidently
  • Explain Spark execution in interviews
  • Design better data pipelines on Databricks or AWS EMR

1. What Is a PySpark Statement?

A PySpark statement is a single line or logical block of PySpark code that performs an operation on a DataFrame or RDD.

Example:

df = spark.read.parquet("s3://bronze/orders/")
df_filtered = df.filter("status = 'COMPLETE'")
df_grouped = df_filtered.groupBy("country").count()
df_grouped.write.mode("overwrite").parquet("s3://gold/orders_by_country/")

Each line looks simple, but Spark classifies them into two types of operations:

🔹 Transformations

These create a new DataFrame without executing immediately.

Examples:

  • select()
  • filter()
  • withColumn()
  • join()
  • groupBy()

🔹 Actions

These trigger execution.

Examples:

  • write()
  • count()
  • show()
  • collect()

👉 Spark does nothing until it hits an action. This is called lazy evaluation.


2. From Statement to Spark Job

A Spark job is created only when an action is executed.

In our example:

df_grouped.write.parquet(...)

This line triggers one Spark job.

If you later run:

df_grouped.count()

Spark creates another job.

Rule:

Each action = one Spark job

You can see jobs in:

  • Spark UI → Jobs tab
  • Databricks job run UI
  • EMR Spark History Server

3. How Spark Breaks a Job into Stages

Once a job is triggered, Spark analyzes the entire logical plan and splits it into stages.

What is a Stage?

A stage is a set of tasks that can be executed in parallel without requiring data to move between executors.

Spark creates a new stage whenever a shuffle is required.


4. The Shuffle: The Reason Stages Exist

Some operations require data to be redistributed across executors:

Operations that cause shuffle:

  • groupBy()
  • join() (non-broadcast)
  • distinct()
  • orderBy()
  • repartition()

Example:

df.groupBy("country").count()

Spark must:

  • Send all rows with country=US to one executor
  • All country=IN to another
  • etc.

This network movement is expensive and creates a stage boundary.


5. Example: Stages in a Simple PySpark Job

df = spark.read.parquet("s3://bronze/orders/")
df_filtered = df.filter("status = 'COMPLETE'")
df_grouped = df_filtered.groupBy("country").count()
df_grouped.write.parquet("s3://gold/orders_by_country/")

Stage breakdown:

StageWhat Happens
Stage 0Read parquet, filter rows
Stage 1Shuffle data by country, aggregate
Stage 2Write output to S3

👉 Even though you wrote 4 statements, Spark created 3 stages.


6. What Is a Task?

A task is the smallest unit of work Spark executes.

Each stage consists of many tasks.
Each task processes one partition.

Rule:

Number of tasks = number of partitions

If your DataFrame has 200 partitions, that stage runs 200 tasks in parallel (subject to available cores).


7. Partitions: The Heart of Parallelism

A partition is a chunk of data processed by one task.

When Spark reads data:

spark.read.parquet("s3://bronze/orders/")

Partitions are determined by:

  • File count
  • File size
  • spark.sql.files.maxPartitionBytes
  • Existing partitioning in S3

Example:

  • 100 files → ~100 partitions
  • 1 TB single file → split into many partitions

8. Partition → Task → Core Mapping

The execution flow is:

Partition → Task → Executor Core

If you have:

  • 8 executors
  • 4 cores per executor
  • Total 32 cores

Spark can run:

  • 32 tasks at the same time

If your stage has:

  • 320 tasks → 10 waves of execution

This is why partition count directly affects performance.


9. How Tasks Run Inside a Stage

Inside a stage:

  • All tasks run the same logic
  • Each task processes different partition data
  • Tasks are independent
  • No communication between tasks

If one task is slow (data skew), the whole stage waits.

This is why skewed partitions kill performance.


10. Visualizing the Execution

Think of it like this:

Job
├── Stage 0 (100 tasks)
│ ├── Task 1 (Partition 1)
│ ├── Task 2 (Partition 2)
│ └── ...
├── Stage 1 (200 tasks - shuffle)
│ ├── Task 1
│ └── ...
└── Stage 2 (100 tasks - write)

11. Repartition vs Coalesce: Stage Impact

repartition()

  • Causes shuffle
  • Creates new stage
  • Even distribution

coalesce()

  • No shuffle
  • No new stage
  • Merges partitions only

Example:

df.repartition(200) # new stage
df.coalesce(50) # same stage

12. Real-World Example (AWS / Databricks)

Suppose:

  • Bronze S3 data: 500 partitions
  • Silver layer: heavy joins
  • Gold layer: aggregations

Typical stage flow:

  1. Read Bronze (Stage 0)
  2. Filter & enrich (Stage 1)
  3. Join dimension tables (Stage 2 – shuffle)
  4. Aggregate metrics (Stage 3 – shuffle)
  5. Write Gold (Stage 4)

👉 Understanding this helps you:

  • Tune partition size
  • Avoid unnecessary shuffles
  • Optimize cluster cost

13. How to See Stages and Tasks

Use Spark UI:

  • Jobs tab → number of jobs
  • Stages tab → shuffle boundaries
  • Tasks tab → skew, duration, failures
  • SQL tab → logical vs physical plan

Or use:

df.explain(True)

14. Key Performance Rules to Remember

ConceptRule
StatementDefines logic, not execution
ActionTriggers job
ShuffleCreates new stage
StageRuns tasks in parallel
TaskProcesses one partition
PartitionControls parallelism
Too many partitionsOverhead
Too few partitionsUnderutilization

15. Final Summary

When you write PySpark code, Spark transforms it into:

  1. Statements → logical plan
  2. Action → job
  3. Shuffle boundaries → stages
  4. Partitions → tasks
  5. Tasks → executor cores

Start Discussion

This site uses Akismet to reduce spam. Learn how your comment data is processed.