As a new data engineer, one of the most common performance issues you’ll run into is data skew.
👉 Data skew happens when some partitions have way more data than others. This causes certain Spark tasks to take much longer, leading to slow pipelines, wasted resources, and even job failures.
But don’t worry—Spark UI in Databricks is your best friend to detect and fix this!
Here’s a step-by-step beginner guide to monitoring data skew using Spark UI:
🔹 What is Data Skew (Simply Put)?
Imagine you split your data into 10 buckets (partitions), and:
- 1 bucket has 9 million rows
- 9 buckets have only 100 rows each
That’s skew! Spark will process the small buckets super fast, but wait forever on the one overloaded bucket.
🛠️ Step-by-Step: Detecting Skew in Spark UI
✅ 1. Run your job in Databricks
Start any transformation (e.g., a groupBy, join, or write) that involves shuffle operations.
✅ 2. Open the Spark UI
Once the job is running or completed:
- Go to the “Jobs” tab in Databricks
- Click on the job you’re analyzing
- Click into “Stages” to see details for each computation step
✅ 3. Look at Task Durations
Inside a stage, click “Tasks”
- Check the duration, input size, and shuffle read/write size
- If you see a few tasks with massively higher durations or data size, you have skew
👉 Example:
| Task | Duration | Input Size |
|---|---|---|
| 1 | 2s | 10 MB |
| 2 | 2s | 10 MB |
| 3 | 300s | 5 GB |
| 👈 That’s skew right there! |
🔍 Other Signs of Skew
- One task lags behind all others
- High memory or GC usage on one executor
- Shuffle data grows unexpectedly large
🎯 How to Fix It (Basic Ideas)
- Salting the key (for skewed joins):
- Add randomness to keys before joining, then deduplicate after.
- Filter heavy keys separately:
- Process skewed keys (like NULLs or 0s) in a different path.
- Use
repartition()orcoalesce():- Helps balance data before heavy operations.
- Broadcast joins (if one side is small):
- Use
.broadcast(df)to skip shuffles altogether.
- Use
💡 Pro Tips for Newbies
- Use
explain()in notebooks to see where Spark plans to shuffle. - Start with small sample datasets to experiment with skew-handling techniques.
- Learn to interpret the DAG (Directed Acyclic Graph) in Spark UI to trace where skew happens.
🚀 Why this matters:
Skew can silently kill your performance. Learning how to catch it early and eliminate makes your pipelines faster, cheaper, and more scalable.
🧠 As you grow, Spark UI will be your go-to toolbox to debug issues like:
- Memory bottlenecks
- Long running stages
- Failed tasks






