In a Mphasis interview, two PySpark questions were asked about skipping the first two rows of a dataframe and counting the nulls in each column.

    PySpark Interview Questions: Skip the First 2 Rows and Count the Nulls of Each Column
    Pint Mug by Jacob Whitmore is licensed under CC-BY 3.0

    1. Skipping the first two rows: PySpark

    In PySpark, to skip the header you can use option(“skipFirstLine”, “true”)

    However, there’s no built-in option to skip additional lines beyond the header.

    To skip the first two rows, you can use the skip option in combination with the drop method after reading the CSV file.

    from pyspark.sql import SparkSession
    
    # Create Spark session
    spark = SparkSession.builder \
        .appName("Read CSV with Skip Rows") \
        .getOrCreate()
    
    # Define CSV path
    csv_path = "path/to/your/file.csv"
    
    # Read CSV file into DataFrame
    df = spark.read \
        .option("header", "true") \
        .option("inferSchema", "true") \
        .csv(csv_path)
    
    # Drop the first two rows
    df = df.drop(*df.head(2))
    
    # Show DataFrame
    df.show()
    

    The *df.head(2), unpacks the list of row objects into separate Row objects.

    2. Counting NULLs of each column: PySpark

    In PySpark, you can count the number of null values in each column of a DataFrame using the isNull() method combined with a list comprehension to iterate over all columns.

    from pyspark.sql import SparkSession
    from pyspark.sql.functions import col
    
    # Create Spark session
    spark = SparkSession.builder \
        .appName("Count Null Values Example") \
        .getOrCreate()
    
    # Create a sample DataFrame
    data = [(1, "A", None),
            (2, "B", 30),
            (3, "C", None),
            (4, None, 40),
            (5, "E", 50)]
    
    columns = ["ID", "Name", "Age"]
    
    df = spark.createDataFrame(data, columns)
    
    # Count null values for each column
    null_counts = [df.filter(col(column_name).isNull()).count() for column_name in df.columns]
    
    # Create a dictionary to store column names and their corresponding null counts
    null_counts_dict = dict(zip(df.columns, null_counts))
    
    # Display null counts
    for column, count in null_counts_dict.items():
        print(f"{column}: {count} null values")
    

    Output

    (6) Spark Jobs
    df:pyspark.sql.dataframe.DataFrame = [ID: long, Name: string ... 1 more field]
    ID: 0 null values
    Name: 1 null values
    Age: 2 null values
    Command took 22.52 seconds -- by info@srinimf.com at 4/23/2024, 4:24:39 PM on my-cluster

    Conclusion

    The PySpark interview questions shared here focus on two key areas: skipping the first two rows of a DataFrame and counting the null values of each column.

    In the first scenario, the option to skip additional lines beyond the header in PySpark does not exist. However, it is possible to skip the first two rows by using the skip option in combination with the drop method after reading the CSV file. This involves dropping the first two rows using the drop method.

    Moving on to the second question, to count the number of null values in each column of a DataFrame, the isNull() method combined with a list comprehension is leveraged to iterate over all columns. This enables the creation of a dictionary to store column names and their corresponding null counts, offering a comprehensive view of null value distribution.

    Overall, these questions provide valuable insights into handling data preprocessing and management in PySpark, addressing common challenges encountered in data manipulation within the PySpark environment.