A PySpark sample program that show to drop a column(s) that have NULLs more than the threshold. We have explained each step with the expected result.

Drop High Null Values Columns in PySpark,
Photo by The Lazy Artist Gallery on Pexels.com

Drop a Column That Has NULLS more than Threshold

The code aims to find columns with more than 30% null values and drop them from the DataFrame. Let’s go through each part of the code in detail to understand what’s happening:

from pyspark.sql import SparkSession
from pyspark.sql.types import StringType, IntegerType, LongType
import pyspark.sql.functions as F

spark = SparkSession.builder.appName("Test").getOrCreate()

data=(["Name1", 20], ["Name2", 30], ["Name3", 40], ["Name3", None], ["Name4", None ])
columns= ("Empname", "Age")

df=spark.createDataFrame(data, columns)

# drop  Columns that have NULLs that have 40 percent nulls
threshold = 0.3  # 30 percent of Nulls allowed in that column
total_rows = df.count()
# Get null percentage for each column
null_percentage = df.select([(F.count(F.when(F.col(c).isNull(), c))/total_rows).alias(c) for c in df.columns])

null_percentage.show()

cols_to_drop = [col for col in null_percentage.columns if null_percentage.first()[col] > threshold ]

# Since NULL values in the Age Column morethan 30 percent that columns will drop from the final result
df=df.drop(*cols_to_drop)

df.show()

Step-by-step Breakdown

data = [("Name1", 20), ("Name2", 30), ("Name3", 40), ("Name3", None), ("Name4", None)]
columns = ("Empname", "Age")
df = spark.createDataFrame(data, columns)

You created a DataFrame df with two columns, Empname and Age. The Age column has two None values (nulls).

DataFrame df:

EmpnameAge
Name120
Name230
Name340
Name3null
Name4null


Defining the Threshold:

threshold = 0.3 # 30% null values allowed in a column
total_rows = df.count()

You set the null threshold to 30%. Columns with a null percentage greater than 30% will be dropped. You also calculated the total number of rows using df.count(), which is 5 in this case.

Calculating the Null Percentage:

null_percentage = df.select([(F.count(F.when(F.col(c).isNull(), c)) / total_rows).alias(c) for c in df.columns])

This line of code calculates the percentage of null values for each column:

F.when(F.col(c).isNull(), c) checks if each column c is null.
F.count(F.when(...)) counts the number of null values in column c.
Dividing this count by total_rows gives the null percentage for column c.

null_percentage DataFrame Output:

+-------+----+
|Empname| Age|
+-------+----+
| 0.0 | 0.4|
+-------+----+

Here:

Empname has 0% null values (0.0).
Age has 40% null values (0.4).

Identifying Columns to Drop:

cols_to_drop = [col for col in null_percentage.columns if null_percentage.first()[col] > threshold]

This line creates a list of columns to drop:

  • It iterates over each column in null_percentage.columns.
  • For each column col, it checks if the percentage of nulls (null_percentage.first()[col]) is greater than the threshold (0.3).

In this case, the “Age” column has a null percentage of 0.4, which is greater than the threshold (0.3). So, “Age” is added to the list cols_to_drop.

Dropping Columns:

df = df.drop(*cols_to_drop)

The drop(*cols_to_drop) method drops all columns listed in cols_to_drop. In this case, “Age” is dropped.

Displaying the Result:

df.show()

The final DataFrame only contains the Empname column because “Age” was dropped due to exceeding the 30% null threshold.

Final DataFrame Output:

+-------+
|Empname|
+-------+
| Name1 |
| Name2 |
| Name3 |
| Name3 |
| Name4 |
+-------+

Summary

The logic in your code:

  • Calculates the percentage of null values for each column.
  • Drops columns where the null percentage exceeds the defined threshold of 30%.
  • Displays the final DataFrame without the dropped columns.

This approach helps you clean up the DataFrame. It does this by automatically removing columns with a high percentage of missing data. This is often an essential step in data preprocessing.