PySpark Dataframe: Skipping First Rows and Counting Null Values

In a Mphasis interview, two PySpark questions were asked about skipping the first two rows of a dataframe and counting the nulls in each column.

PySpark Interview Questions: Skip the First 2 Rows and Count the Nulls of Each Column — Pint Mug by Jacob Whitmore is licensed under CC-BY 3.0

1. Skipping the first two rows: PySpark

In PySpark, to skip the header you can use option(“skipFirstLine”, “true”).

However, there’s no built-in option to skip additional lines beyond the header.

To skip the first two rows, you can use the skip option in combination with the drop method after reading the CSV file.

from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder \
    .appName("Read CSV with Skip Rows") \
    .getOrCreate()

# Define CSV path
csv_path = "path/to/your/file.csv"

# Read CSV file into DataFrame
df = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .csv(csv_path)

# Drop the first two rows
df = df.drop(*df.head(2))

# Show DataFrame
df.show()

The *df.head(2), unpacks the list of row objects into separate Row objects.

2. Counting NULLs of each column: PySpark

In PySpark, you can count the number of null values in each column of a DataFrame using the isNull() method combined with a list comprehension to iterate over all columns.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Create Spark session
spark = SparkSession.builder \
    .appName("Count Null Values Example") \
    .getOrCreate()

# Create a sample DataFrame
data = [(1, "A", None),
        (2, "B", 30),
        (3, "C", None),
        (4, None, 40),
        (5, "E", 50)]

columns = ["ID", "Name", "Age"]

df = spark.createDataFrame(data, columns)

# Count null values for each column
null_counts = [df.filter(col(column_name).isNull()).count() for column_name in df.columns]

# Create a dictionary to store column names and their corresponding null counts
null_counts_dict = dict(zip(df.columns, null_counts))

# Display null counts
for column, count in null_counts_dict.items():
    print(f"{column}: {count} null values")

Output

(6) Spark Jobs
df:pyspark.sql.dataframe.DataFrame = [ID: long, Name: string ... 1 more field]
ID: 0 null values
Name: 1 null values
Age: 2 null values
Command took 22.52 seconds -- by info@srinimf.com at 4/23/2024, 4:24:39 PM on my-cluster

Conclusion

The PySpark interview questions shared here focus on two key areas: skipping the first two rows of a DataFrame and counting the null values of each column.

In the first scenario, the option to skip additional lines beyond the header in PySpark does not exist. However, it is possible to skip the first two rows by using the skip option in combination with the drop method after reading the CSV file. This involves dropping the first two rows using the drop method.

Moving on to the second question, to count the number of null values in each column of a DataFrame, the isNull() method combined with a list comprehension is leveraged to iterate over all columns. This enables the creation of a dictionary to store column names and their corresponding null counts, offering a comprehensive view of null value distribution.

Overall, these questions provide valuable insights into handling data preprocessing and management in PySpark, addressing common challenges encountered in data manipulation within the PySpark environment.

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.