6 Must Read PySpark Interview Questions: Hexaware:

We collected questions about Databricks and PySpark from Hexware related to the data engineer role.

Data engineer interview questions — xr:d:DAF-dSnSxV4:2,j:5772909338639170320,t:24030313

SQL and PySpark Questions

01. How do you extract non-match rows from two tables?

Write a sub-query for a non-matching row and apply UNION to merge two tables.

select *from table1
where name not in (select name from table2)
union
select * from table2
where name not in (select name from table1);

02. Is there a way to manually add versioning to Parquet files?

Here’s code to generate a version when writing a file each time.

from pyspark.sql import SparkSession
from pyspark.sql.functions import lit

# Initialize Spark session
spark = SparkSession.builder \
    .appName("Delta Parquet Versioning Example") \
    .getOrCreate()

# Sample DataFrame
data = [("John", 25), ("Alice", 30), ("Bob", 35)]
df = spark.createDataFrame(data, ["Name", "Age"])

# Generate version number
version = 1

# Add version column to DataFrame
df_with_version = df.withColumn("version", lit(version))

# Define output path
output_path = f"/path/to/output/delta_parquet/v{version}"

# Write DataFrame to delta Parquet files partitioned by version
df_with_version.write.partitionBy("version").format("delta").save(output_path)
# Stop Spark session
spark.stop()

We create a sample DataFrame df. We generate a version number, in this case, version = 1.
We add a new column called “version” to the DataFrame with the same version number for all rows.
We define the output path for the delta Parquet files, incorporating the version number into the path (/path/to/output/delta_parquet/v1).
We write the DataFrame to delta Parquet files using:
- df.write.partitionBy(“version”).format(“delta”).save(output_path), which partitions the data based on the version number.
You can modify the version variable to any integer value to version your output delta Parquet files accordingly. This way, each time you generate a new output file, you can update the version number to keep track of different versions of your data.

03. Where can the resulting transformations be stored in PySpark?

In PySpark, you have several options for storing the resultant transformations:

Parquet Files: Parquet is a columnar storage format that is highly efficient for analytics. You can store the resultant as Parquet files using the:
- write.parquet() method. df.write.parquet("/path/to/output/parquet")
Delta Lake: Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads. It provides features like ACID transactions, scalable metadata handling, and data versioning. You can store the transformations as Delta tables using the Delta Lake format:
- df.write.format("delta").save("/path/to/output/delta_table")
CSV Files: If you prefer a human-readable format, you can store the DataFrame as CSV files using the write.csv() method:
- df.write.csv("/path/to/output/csv")
JSON Files: JSON is another choice for storing data, especially if you need a semi-structured format. You can save the DataFrame as JSON files using the write.json() method:
- df.write.json("/path/to/output/json")
Hive Tables: If you’re working in a Hive-enabled environment, you can also save the DataFrame as a Hive table using the saveAsTable() method:
- df.write.saveAsTable("database.table_name")

Choose the storage format based on your requirements about efficiency, ease of use, and compatibility with downstream processing systems. Each format has advantages and trade-offs in performance, storage efficiency, and data integrity.

04. What is Catalyst Optimization, and how does the developer implement it?

Apache Spark’s SQL processing engine relies on Catalyst Optimization. It is a key part. It ensures efficient optimization and execution of SQL queries.
The logical and physical query plans generated by Spark SQL are essential. They apply a series of optimizations meant to improve performance.
Catalyst performs various transformations and rule-based optimizations on query plans to generate an optimized execution plan.

05. How can I read a CSV without a schema and create a new schema in Pyspark?

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

csv_path = "/path/to/schemaless.csv"
raw_df = spark.read.option("header", "true").csv(csv_path)
raw_df.printSchema()


new_schema = StructType([
    StructField("column1", StringType(), True),
    StructField("column2", IntegerType(), True),
    StructField("column3", DoubleType(), True),
    # Add more fields as needed
])
new_df = raw_df \
    .withColumn("column1", raw_df["column1"].cast(StringType())) \
    .withColumn("column2", raw_df["column2"].cast(IntegerType())) \
    .withColumn("column3", raw_df["column3"].cast(DoubleType()))
new_df.printSchema()

06. What is AQE (Adaptive Query Execution)? Does it need developer intervention for implementation?

Adaptive Query Execution (AQE) that optimizes query execution dynamically based on runtime statistics.
This feature aims to improve the performance of Spark jobs by adjusting the execution plan during runtime, rather than relying solely on static optimization strategies.

Conclusion

In conclusion, mastering SQL and PySpark is important for a data engineer role, especially for Hexware interviews.
Candidates should be prepared to tackle extensive questions. These range from basic data extraction to more complex topics like versioning, Catalyst Optimization, and Adaptive Query Execution.
These concepts and their practical applications will undoubtedly impress interviewers. It will also increase the likelihood of success in the interview process.

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.