To count the number of NULL values in each column of a PySpark DataFrame, you can use the isNull() function. Use this function with the agg method to compute the counts. PySpark’s isNull() method checks for NULL values, and then you can aggregate these checks to count them.

PySpark Counting NULL on each column
Simple lenses in cardboard box, French, circa 1840 by Gaudin (maker) is licensed under CC-BY-NC-SA 4.0

Counting NULL Values in Each Column

Let’s assume you have a DataFrame called df with some NULL values, and you want to count the number of NULL values in each column:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum

# Initialize Spark session
spark = SparkSession.builder.appName("CountNullsInColumns").getOrCreate()

# Example DataFrame with some NULL values
data = [("A", 1, None), ("B", None, 2), (None, 3, 3), ("C", 4, None), ("D", None, None)]
columns = ["col1", "col2", "col3"]

df = spark.createDataFrame(data, columns)

# Count NULLs in each column
null_counts = df.select([sum(col(c).isNull().cast("int")).alias(c) for c in df.columns])

# Show the result
null_counts.show()

Explanation

  1. col(c).isNull(): For each column c, checks whether the value is NULL.
  2. cast("int"): Converts the boolean result. True represents NULL. False represents not NULL. The result is converted to an integer: 1 for True and 0 for False.
  3. sum(): Aggregates the 1s to count the number of NULL values in each column.
  4. alias(c): Renames the resulting column to match the original column name.

Output

Running the above code will produce an output similar to this:

+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2| 3|
+----+----+----+

This output shows the count of NULL values for each column in the DataFrame.

Explanation in Detail

  • select([ ... ]): Uses list comprehension to generate the expressions for counting NULL values in each column.
  • null_counts: The resulting DataFrame contains the counts of NULL values for each column.

Summary

This method is efficient for counting NULL values in each column of a PySpark DataFrame. You can use it for any DataFrame with a large number of columns. This will help you quickly determine where data is missing.