A PySpark sample program that show to drop a column(s) that have NULLs more than the threshold. We have explained each step with the expected result.

Drop a Column That Has NULLS more than Threshold
The code aims to find columns with more than 30% null values and drop them from the DataFrame. Let’s go through each part of the code in detail to understand what’s happening:
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType, IntegerType, LongType
import pyspark.sql.functions as F
spark = SparkSession.builder.appName("Test").getOrCreate()
data=(["Name1", 20], ["Name2", 30], ["Name3", 40], ["Name3", None], ["Name4", None ])
columns= ("Empname", "Age")
df=spark.createDataFrame(data, columns)
# drop Columns that have NULLs that have 40 percent nulls
threshold = 0.3 # 30 percent of Nulls allowed in that column
total_rows = df.count()
# Get null percentage for each column
null_percentage = df.select([(F.count(F.when(F.col(c).isNull(), c))/total_rows).alias(c) for c in df.columns])
null_percentage.show()
cols_to_drop = [col for col in null_percentage.columns if null_percentage.first()[col] > threshold ]
# Since NULL values in the Age Column morethan 30 percent that columns will drop from the final result
df=df.drop(*cols_to_drop)
df.show()
Step-by-step Breakdown
data = [("Name1", 20), ("Name2", 30), ("Name3", 40), ("Name3", None), ("Name4", None)]
columns = ("Empname", "Age")
df = spark.createDataFrame(data, columns)
You created a DataFrame df with two columns, Empname and Age. The Age column has two None values (nulls).
DataFrame df:
| Empname | Age |
|---|
| Name1 | 20 |
| Name2 | 30 |
| Name3 | 40 |
| Name3 | null |
| Name4 | null |
Defining the Threshold:
threshold = 0.3 # 30% null values allowed in a column
total_rows = df.count()
You set the null threshold to 30%. Columns with a null percentage greater than 30% will be dropped. You also calculated the total number of rows using df.count(), which is 5 in this case.
Calculating the Null Percentage:
null_percentage = df.select([(F.count(F.when(F.col(c).isNull(), c)) / total_rows).alias(c) for c in df.columns])
This line of code calculates the percentage of null values for each column:
F.when(F.col(c).isNull(), c) checks if each column c is null.
F.count(F.when(...)) counts the number of null values in column c.
Dividing this count by total_rows gives the null percentage for column c.
null_percentage DataFrame Output:
+-------+----+
|Empname| Age|
+-------+----+
| 0.0 | 0.4|
+-------+----+
Here:
Empname has 0% null values (0.0).
Age has 40% null values (0.4).
Identifying Columns to Drop:
cols_to_drop = [col for col in null_percentage.columns if null_percentage.first()[col] > threshold]
This line creates a list of columns to drop:
- It iterates over each column in null_percentage.columns.
- For each column col, it checks if the percentage of nulls (null_percentage.first()[col]) is greater than the threshold (0.3).
In this case, the “Age” column has a null percentage of 0.4, which is greater than the threshold (0.3). So, “Age” is added to the list cols_to_drop.
Dropping Columns:
df = df.drop(*cols_to_drop)
The drop(*cols_to_drop) method drops all columns listed in cols_to_drop. In this case, “Age” is dropped.
Displaying the Result:
df.show()
The final DataFrame only contains the Empname column because “Age” was dropped due to exceeding the 30% null threshold.
Final DataFrame Output:
+-------+
|Empname|
+-------+
| Name1 |
| Name2 |
| Name3 |
| Name3 |
| Name4 |
+-------+
Summary
The logic in your code:
- Calculates the percentage of null values for each column.
- Drops columns where the null percentage exceeds the defined threshold of 30%.
- Displays the final DataFrame without the dropped columns.
This approach helps you clean up the DataFrame. It does this by automatically removing columns with a high percentage of missing data. This is often an essential step in data preprocessing.







You must be logged in to post a comment.