PySpark DataFrame: Counting NULL Values in Each Column

To count the number of NULL values in each column of a PySpark DataFrame, you can use the isNull() function. Use this function with the agg method to compute the counts. PySpark’s isNull() method checks for NULL values, and then you can aggregate these checks to count them.

PySpark Counting NULL on each column — Simple lenses in cardboard box, French, circa 1840 by Gaudin (maker) is licensed under CC-BY-NC-SA 4.0

Counting `NULL` Values in Each Column

Let’s assume you have a DataFrame called df with some NULL values, and you want to count the number of NULL values in each column:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum

# Initialize Spark session
spark = SparkSession.builder.appName("CountNullsInColumns").getOrCreate()

# Example DataFrame with some NULL values
data = [("A", 1, None), ("B", None, 2), (None, 3, 3), ("C", 4, None), ("D", None, None)]
columns = ["col1", "col2", "col3"]

df = spark.createDataFrame(data, columns)

# Count NULLs in each column
null_counts = df.select([sum(col(c).isNull().cast("int")).alias(c) for c in df.columns])

# Show the result
null_counts.show()

Explanation

col(c).isNull(): For each column c, checks whether the value is NULL.
cast("int"): Converts the boolean result. True represents NULL. False represents not NULL. The result is converted to an integer: 1 for True and 0 for False.
sum(): Aggregates the 1s to count the number of NULL values in each column.
alias(c): Renames the resulting column to match the original column name.

Output

Running the above code will produce an output similar to this:

+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1|   2|   3|
+----+----+----+

This output shows the count of NULL values for each column in the DataFrame.

Explanation in Detail

select([ ... ]): Uses list comprehension to generate the expressions for counting NULL values in each column.
null_counts: The resulting DataFrame contains the counts of NULL values for each column.

Summary

This method is efficient for counting NULL values in each column of a PySpark DataFrame. You can use it for any DataFrame with a large number of columns. This will help you quickly determine where data is missing.

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.