In PySpark, you can handle NULL values using several functions that provide similar functionality to SQL. Below is an explanation of NULLIF, IFNULL, NVL, and NVL2, along with examples of how to use them in PySpark.

1. NULLIF

The NULLIF function returns NULL if two expressions are equal; otherwise, it returns the first expression. In PySpark, you can use a combination of conditional functions to replicate this behavior.

Usage:

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.appName("NullIfExample").getOrCreate()

data = [(1, 1), (2, 2), (3, 4), (4, 4)]
columns = ["a", "b"]
df = spark.createDataFrame(data, columns)

df = df.withColumn("NullIfResult", F.when(F.col("a") == F.col("b"), None).otherwise(F.col("a")))

df.show()

2. IFNULL

IFNULL returns the second argument if the first argument is NULL; otherwise, it returns the first argument. In PySpark, you can use the coalesce() function to achieve this.

Usage:

df = df.withColumn("IfNullResult", F.coalesce(F.col("a"), F.lit(0)))

df.show()

3. NVL

NVL is similar to IFNULL and is used to return a substitute value when the specified value is NULL. In PySpark, NVL can also be implemented using coalesce().

Usage:

df = df.withColumn("NvlResult", F.coalesce(F.col("b"), F.lit(0)))

df.show()

4. NVL2

NVL2 returns the second argument if the first argument is not NULL, otherwise, it returns the third argument. This can be implemented in PySpark using the when function.

Usage:

df = df.withColumn("Nvl2Result", F.when(F.col("a").isNotNull(), F.col("b")).otherwise(F.lit("Value is NULL")))

df.show()

Complete Example

Here’s a complete example that brings everything together:

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

# Create Spark session
spark = SparkSession.builder.appName("NullHandlingExample").getOrCreate()

# Sample data
data = [(1, None), (2, 2), (3, None), (4, 4)]
columns = ["a", "b"]
df = spark.createDataFrame(data, columns)

# NULLIF equivalent
df = df.withColumn("NullIfResult", F.when(F.col("a") == F.col("b"), None).otherwise(F.col("a")))

# IFNULL equivalent
df = df.withColumn("IfNullResult", F.coalesce(F.col("a"), F.lit(0)))

# NVL equivalent
df = df.withColumn("NvlResult", F.coalesce(F.col("b"), F.lit(0)))

# NVL2 equivalent
df = df.withColumn("Nvl2Result", F.when(F.col("a").isNotNull(), F.col("b")).otherwise(F.lit("Value is NULL")))

# Display the result
df.show()

Expected Output

The output DataFrame will include the original columns and the results of the NULLIF, IFNULL, NVL, and NVL2 operations, showing how each function handles NULL values.

+---+----+-------------+-------------+---------+------------------------+
| a| b| NullIfResult| IfNullResult| NvlResult| Nvl2Result |
+---+----+-------------+-------------+---------+------------------------+
| 1|null| 1| 1| 0| Value is NULL|
| 2| 2| 2| 2| 2| 2|
| 3|null| 3| 3| 0| Value is NULL|
| 4| 4| 4| 4| 4| 4|
+---+----+-------------+-------------+---------+------------------------+

Summary

  • NULLIF: Returns NULL if two values are equal.
  • IFNULL / NVL: Returns the second value if the first is NULL, otherwise returns the first value.
  • NVL2: Returns the second value if the first is not NULL; otherwise, it returns the third value.

These functions are useful for data cleaning and handling NULL values in your DataFrame.