Mitigating Data Skew with Salting Technique: PySpark

In PySpark, salting is a simple trick used to fix a problem called data skew.

Salting

How to handle data skewness in Databricks

What is skewness in Databricks?

Data skew happens when some values in a column show up a lot more than others.

Because of this, some parts of the data become too big, and certain nodes end up doing more work than the rest. This can slow everything down.

Let us say, in distributed systems, data is often divided into partitions based on one or more keys (e.g., in a join or group-by operation).

If a key has more records than others (a skewed key), the partition containing that key becomes overloaded. This creates data skew and results in.

Some partitions are large, causing slower computation.
Uneven distribution of data, leading to suboptimal parallelism.

Salting solves this problem by artificially distributing data for the skewed keys across multiple partitions, resulting in better load balancing.

Advanced Analytics with Pyspark
-Best PySPARK BOOKS

What is Saltining trick?

Salting involves adding a random or deterministic “salt” value. This value is typically a small number or a hash.

It is added to the key, which causes the data to be spread more evenly across different partitions.

How to apply salting?

Add a Salt Column: Create a new “salted” version of the key by adding a small random number or other transformation (e.g., a hash). This effectively creates more unique keys from the skewed key.
Group or Join on the Salted Key: Execute the join or group operation using the salted key instead of the original key. This spreads the skewed data across multiple partitions.
Re-aggregate (if necessary): After the join or aggregation, remove the salt. Recombine the results if the final output should show the original keys.

Real world example

Let’s say you have a transactions DataFrame with customer_id as a key, and one customer (e.g., customer_id = 123) has a disproportionate number of records, causing data skew.

Step 1: Add a Salt Column

First, add a salt column.

from pyspark.sql import functions as F

# Create a new column 'salt' by adding a random integer
salted_df = df.withColumn("salt", F.expr("cast(rand() * 10 as int)"))

# Create a salted key by concatenating the 'customer_id' and 'salt'
salted_df = salted_df.withColumn("salted_customer_id", F.concat(F.col("customer_id"),F.lit("-"), F.col("salt")))

In this example, we add a salt by generating a random integer between 0 and 9. Then, we concatenate it with customer_id. This results in a new salted key salted_customer_id.

Step 2: Perform Join or Group-by on Salted Keys

Second, when performing a join or group-by, you can now use salted_customer_id instead of the original customer_id:

# Example group-by operation using the salted key
result = salted_df.groupBy("salted_customer_id").agg(F.sum("amount").alias("total_amount"))

Step 3: (Optional) Remove Salt

Lastly, once you’ve completed the operation (e.g., aggregation), you might want to remove the salt to return to the original key:

# Remove salt and get back to the original customer_id
final_result = result.withColumn("customer_id", F.split(F.col("salted_customer_id"), "-").getItem(0))

Here, we extract the original customer_id by taking the first part of salted_customer_id.

When to Use Salting

Data Skew in Joins: You are combining two datasets, and one or more keys in one of the datasets have significantly more records than the other keys. Salting is a technique that helps to evenly distribute the imbalanced key across several partitions.
Data Skew in GroupBy: When performing aggregation or grouping operations on skewed keys, salting helps. It prevents one partition from holding a disproportionate number of records.
Skewed Data in Sorting or Window Functions: Skewed data causes sorting or window functions take longer time.

Example Scenario Without Salting

Imagine you have a dataset. In it, 90% of the transactions are linked to customer_id = 123. You execute a groupBy(customer_id) operation. Without salting, one partition will contain the majority of the data, while other partitions will have much less work. This imbalance will lead to slow query performance.

Drawbacks of Salting

Extra Steps: You need to generate and later remove the salt, which adds complexity.
Increased Data Size: Salting can increase the size of the dataset. Each key is now expanded by the salt value.
The overhead of Multiple Keys: Introducing more keys means Spark has to manage extra overhead. This could be an issue in some cases.

Conclusion

In short, salting is a useful way to fix data skew in systems like PySpark.

When data is not split evenly, it can slow things down.

By adding a salt column (a small random value), we can spread the data more evenly.

This helps all nodes do a similar amount of work, which makes the job run faster and smoother.

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.