In a Mphasis interview, two PySpark questions were asked about skipping the first two rows of a dataframe and counting the nulls in each column.

1. Skipping the first two rows: PySpark
In PySpark, to skip the header you can use option(“skipFirstLine”, “true”).
However, there’s no built-in option to skip additional lines beyond the header.
To skip the first two rows, you can use the skip option in combination with the drop method after reading the CSV file.
from pyspark.sql import SparkSession
# Create Spark session
spark = SparkSession.builder \
.appName("Read CSV with Skip Rows") \
.getOrCreate()
# Define CSV path
csv_path = "path/to/your/file.csv"
# Read CSV file into DataFrame
df = spark.read \
.option("header", "true") \
.option("inferSchema", "true") \
.csv(csv_path)
# Drop the first two rows
df = df.drop(*df.head(2))
# Show DataFrame
df.show()
The *df.head(2), unpacks the list of row objects into separate Row objects.
2. Counting NULLs of each column: PySpark
In PySpark, you can count the number of null values in each column of a DataFrame using the isNull() method combined with a list comprehension to iterate over all columns.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
# Create Spark session
spark = SparkSession.builder \
.appName("Count Null Values Example") \
.getOrCreate()
# Create a sample DataFrame
data = [(1, "A", None),
(2, "B", 30),
(3, "C", None),
(4, None, 40),
(5, "E", 50)]
columns = ["ID", "Name", "Age"]
df = spark.createDataFrame(data, columns)
# Count null values for each column
null_counts = [df.filter(col(column_name).isNull()).count() for column_name in df.columns]
# Create a dictionary to store column names and their corresponding null counts
null_counts_dict = dict(zip(df.columns, null_counts))
# Display null counts
for column, count in null_counts_dict.items():
print(f"{column}: {count} null values")
Output
(6) Spark Jobs
df:pyspark.sql.dataframe.DataFrame = [ID: long, Name: string ... 1 more field]
ID: 0 null values
Name: 1 null values
Age: 2 null values
Command took 22.52 seconds -- by info@srinimf.com at 4/23/2024, 4:24:39 PM on my-cluster
Conclusion
The PySpark interview questions shared here focus on two key areas: skipping the first two rows of a DataFrame and counting the null values of each column.
In the first scenario, the option to skip additional lines beyond the header in PySpark does not exist. However, it is possible to skip the first two rows by using the skip option in combination with the drop method after reading the CSV file. This involves dropping the first two rows using the drop method.
Moving on to the second question, to count the number of null values in each column of a DataFrame, the isNull() method combined with a list comprehension is leveraged to iterate over all columns. This enables the creation of a dictionary to store column names and their corresponding null counts, offering a comprehensive view of null value distribution.
Overall, these questions provide valuable insights into handling data preprocessing and management in PySpark, addressing common challenges encountered in data manipulation within the PySpark environment.







You must be logged in to post a comment.