How to Drop Columns with High NULL Values in PySpark

A PySpark sample program that show to drop a column(s) that have NULLs more than the threshold. We have explained each step with the expected result.

Drop High Null Values Columns in PySpark, — Photo by The Lazy Artist Gallery on Pexels.com

Drop a Column That Has NULLS more than Threshold

The code aims to find columns with more than 30% null values and drop them from the DataFrame. Let’s go through each part of the code in detail to understand what’s happening:

from pyspark.sql import SparkSession
from pyspark.sql.types import StringType, IntegerType, LongType
import pyspark.sql.functions as F

spark = SparkSession.builder.appName("Test").getOrCreate()

data=(["Name1", 20], ["Name2", 30], ["Name3", 40], ["Name3", None], ["Name4", None ])
columns= ("Empname", "Age")

df=spark.createDataFrame(data, columns)

# drop  Columns that have NULLs that have 40 percent nulls
threshold = 0.3  # 30 percent of Nulls allowed in that column
total_rows = df.count()
# Get null percentage for each column
null_percentage = df.select([(F.count(F.when(F.col(c).isNull(), c))/total_rows).alias(c) for c in df.columns])

null_percentage.show()

cols_to_drop = [col for col in null_percentage.columns if null_percentage.first()[col] > threshold ]

# Since NULL values in the Age Column morethan 30 percent that columns will drop from the final result
df=df.drop(*cols_to_drop)

df.show()

Step-by-step Breakdown

data = [("Name1", 20), ("Name2", 30), ("Name3", 40), ("Name3", None), ("Name4", None)]
columns = ("Empname", "Age")
df = spark.createDataFrame(data, columns)

You created a DataFrame df with two columns, Empname and Age. The Age column has two None values (nulls).

DataFrame df:

Empname	Age

Name1

20

Name2

30

Name3

40

Name3

null

Name4

null

Defining the Threshold:

threshold = 0.3 # 30% null values allowed in a column
total_rows = df.count()

You set the null threshold to 30%. Columns with a null percentage greater than 30% will be dropped. You also calculated the total number of rows using df.count(), which is 5 in this case.

Calculating the Null Percentage:

null_percentage = df.select([(F.count(F.when(F.col(c).isNull(), c)) / total_rows).alias(c) for c in df.columns])

This line of code calculates the percentage of null values for each column:

F.when(F.col(c).isNull(), c) checks if each column c is null.
F.count(F.when(...)) counts the number of null values in column c.
Dividing this count by total_rows gives the null percentage for column c.

null_percentage DataFrame Output:

+-------+----+
|Empname| Age|
+-------+----+
| 0.0   | 0.4|
+-------+----+

Here:

Empname has 0% null values (0.0).
Age has 40% null values (0.4).

Identifying Columns to Drop:

cols_to_drop = [col for col in null_percentage.columns if null_percentage.first()[col] > threshold]

This line creates a list of columns to drop:

It iterates over each column in null_percentage.columns.
For each column col, it checks if the percentage of nulls (null_percentage.first()[col]) is greater than the threshold (0.3).

In this case, the “Age” column has a null percentage of 0.4, which is greater than the threshold (0.3). So, “Age” is added to the list cols_to_drop.

Dropping Columns:

df = df.drop(*cols_to_drop)

The drop(*cols_to_drop) method drops all columns listed in cols_to_drop. In this case, “Age” is dropped.

Displaying the Result:

df.show()

The final DataFrame only contains the Empname column because “Age” was dropped due to exceeding the 30% null threshold.

Final DataFrame Output:

+-------+
|Empname|
+-------+
| Name1 |
| Name2 |
| Name3 |
| Name3 |
| Name4 |
+-------+

Summary

The logic in your code:

Calculates the percentage of null values for each column.
Drops columns where the null percentage exceeds the defined threshold of 30%.
Displays the final DataFrame without the dropped columns.

This approach helps you clean up the DataFrame. It does this by automatically removing columns with a high percentage of missing data. This is often an essential step in data preprocessing.

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.