How Databricks Uses Cores and Memory for Efficient Big Data Processing

When working with Databricks, one of the most important yet often misunderstood concepts is how clusters use their memory, cores, and nodes to process big data efficiently. Whether you are tuning your Spark jobs or trying to estimate the precise cluster size for a workload, understanding these components will help you optimize performance, control costs, and avoid bottlenecks.

In this article, we’ll explore:

Understanding the Components of Databricks Clusters
How memory, cores, and nodes are structured
How Spark partitions and processes a dataset
A step-by-step example: processing 100 GB of input data

1. Databricks Cluster Basics

A Databricks cluster is a set of computation resources and configurations that Spark uses to execute your jobs. It has two main parts:

Driver node – Coordinates the execution of your Spark application, holds metadata, and manages the job DAG (Directed Acyclic Graph).
Worker nodes – Do the heavy lifting. They store partitions of data in memory/disk and run tasks in parallel.

Think of the driver as the orchestra conductor and the workers as musicians — the driver plans and coordinates, while workers execute.

2. Memory, Cores, and Nodes: The Building Blocks

Each worker node in Databricks has:

Cores (vCPUs) – Decide how many tasks the worker can run in parallel.
Memory (RAM) – Stores data partitions and intermediate results.
Local storage (SSD) – Used for shuffle spill or when memory is insufficient.

For example, a worker might have:

8 cores
64 GB RAM
1 TB SSD

If your cluster has 4 worker nodes, you now have:

Total cores = 4 × 8 = 32 cores
Total memory = 4 × 64 GB = 256 GB

Wealth Creation Books

3. Spark Data Partitioning

Apache Spark (the engine under Databricks) processes data in partitions.

A partition is the smallest unit of data that Spark operates on.
Each partition is processed by one task on one core.

By default:

Spark creates one partition per 128 MB of uncompressed data in Databricks runtime (this may vary based on configuration).
If your data is compressed, Spark will estimate partition size after decompression.

Why partitions matter:

Too few partitions → cores idle, slower processing.
Too many partitions → task overhead, excessive shuffle.

4. The 100 GB Example: How It Works

Let’s assume:

Input data = 100 GB (uncompressed)
Partition size target = 128 MB
Cluster has 4 workers, each with 8 cores and 64 GB RAM

Step 1: Partitioning the Data

100 GB / 128 MB = 800 partitions

Because:

100 GB = 100 × 1024 MB = 102,400 MB

102,400 / 128 = 800

So if Spark uses a target split size of 128 MB (default for spark.sql.files.maxPartitionBytes), it will create ~800 input partitions.

So Spark will create 800 partitions.

Step 2: Assigning Tasks to Cores

Each partition is processed by one core at a time.

Total cores in cluster = 4 workers × 8 cores = 32 cores
This means 32 partitions can be processed in parallel.

Step 3: Stages of Processing

Spark breaks your job into stages based on transformations:

Narrow transformations (like map, filter) → No shuffle between partitions.
Wide transformations (like groupBy, join) → Require shuffle, which can spill data to disk if memory is insufficient.

Step 4: Memory Usage

Each task needs memory for:
- Data in partition
- Intermediate shuffle data
- Execution overhead (~10–15% of total memory)

If each worker has 64 GB RAM:

Some RAM is reserved for Spark overhead.
Let’s say ~50 GB is available for actual data.
At any given time, the worker runs 8 tasks in parallel (one per core), so each task has ~6–6.5 GB RAM available.

If a partition’s data + shuffle data > available RAM, Spark will spill to SSD.

Step 5: Execution Flow

Read Phase
- Spark reads 800 partitions from the source (S3, ADLS, etc.).
- 32 partitions processed in parallel → takes ~25 waves to complete (800 / 32).
Transformation Phase
- Each partition is transformed in memory.
- If transformation is wide (e.g., join), shuffle happens across nodes.
Shuffle Phase
- Data is redistributed between workers.
- Memory pressure during shuffle can cause spilling.
Write Phase
- Output data is written back in partitions (can be re-partitioned before writing).

5. How to Optimize Processing

a) Right-size partitions

Too big → memory issues.
Too small → overhead.
Use spark.sql.files.maxPartitionBytes to control.

b) Use caching wisely

Cache only if reused multiple times.

c) Monitor shuffle

Optimize joins using broadcast() for small tables.

d) Choose cluster size wisely

More cores = more parallelism.
More RAM = fewer spills.

6. Summary Table: 100 GB Dataset Processing

Item	Value
Dataset Size	100 GB
Partition Size	128 MB
Total Partitions	800
Workers	4
Cores per Worker	8
Total Cores	32
Parallel Tasks	32
Processing Waves	25 (800 ÷ 32)
Memory per Worker	64 GB
Memory per Task	~6.25 GB

Final Thoughts

In Databricks, the performance of your workload depends heavily on how partitions map to cores and how memory is managed. For a 100 GB dataset:

Spark will split the data into 800 partitions.
With 4 workers (8 cores each), you get 32 tasks running in parallel.
The data will be processed in waves, with memory and shuffle behavior impacting performance.

Understanding this flow lets you fine-tune:

Cluster size for speed vs. cost.
Partition size for balanced execution.
Memory usage to reduce spilling.

If you get these basics right, you’ll unlock the true parallel power of Databricks and Spark for your big data workloads.

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.