In this post, we’ll cover some common Spark and SQL interview-style questions (asked at Nike) with practical examples. Whether you’re preparing for interviews or brushing up on PySpark/SQL concepts, this guide will help.
🔹 Spark 3.x vs Spark 2.x — Key Differences
| Feature | Spark 2.x | Spark 3.x |
|---|---|---|
| Adaptive Query Execution (AQE) | ❌ Not available | ✅ Available — dynamically optimizes queries at runtime |
| Dynamic Partition Pruning | ❌ | ✅ Improves join performance with partitioned data |
| ANSI SQL Compliance | Limited | Improved compliance |
| Python Support | Up to Python 3.6 | Up to Python 3.8+ |
| GPU Acceleration | ❌ | ✅ Via RAPIDS plugin |
| Performance | Slower on complex queries | Faster due to AQE, DPP, and optimizations |
🔹 Spark Optimizations vs Delta Lake Optimizations
| Aspect | Spark Optimizations | Delta Lake Optimizations |
|---|---|---|
| Query Optimization | Catalyst + Tungsten | Same + Delta-specific caching |
| Execution | Adaptive Query Execution | Z-ordering, Data skipping |
| Data Handling | RDDs/DataFrames | ACID transactions, schema enforcement |
| Schema Evolution | Manual | ✅ Automatic schema evolution with version tracking |
Delta Lake Example — Schema Evolution:
from delta.tables import *
(df.write.format("delta")
.mode("append")
.option("mergeSchema", "true")
.save("/mnt/delta/events"))
This ensures schema evolution with version tracking.
🔹 How Autoloader Works in Databricks
- Incrementally processes new files arriving in cloud storage (S3, ADLS, GCS).
- Uses file notification mode (event-driven) or directory listing mode.
- Handles schema inference + evolution.
- Scales automatically with large volumes of data.
Example Code:
df = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.load("/mnt/input"))
🔹 SQL: Position-wise Highest Salary (CTE)
1. Using CTE + Aggregation
WITH max_sal_per_position AS (
SELECT position, MAX(sal) AS max_salary
FROM emp
GROUP BY position
)
SELECT e.name, e.position, e.sal
FROM emp e
JOIN max_sal_per_position m
ON e.position = m.position
AND e.sal = m.max_salary;
✅ Returns highest-paid employee(s) per position.
2. Using CTE + DENSE_RANK
WITH ranked_emp AS (
SELECT name, position, sal,
DENSE_RANK() OVER (PARTITION BY position ORDER BY sal DESC) AS rn
FROM emp
)
SELECT name, position, sal
FROM ranked_emp
WHERE rn = 1;
✅ Returns all top earners (handles salary ties).
3. Using CTE + ROW_NUMBER
WITH ranked_emp AS (
SELECT name, position, sal,
ROW_NUMBER() OVER (PARTITION BY position ORDER BY sal DESC) AS rn
FROM emp
)
SELECT name, position, sal
FROM ranked_emp
WHERE rn = 1;
✅ Returns only one employee per position (even if ties exist).
🔹 PySpark Equivalent (DENSE_RANK)
from pyspark.sql.window import Window
from pyspark.sql.functions import dense_rank
windowSpec = Window.partitionBy("position").orderBy(col("sal").desc())
result = (emp_df
.withColumn("rn", dense_rank().over(windowSpec))
.filter("rn = 1"))
✅ Key Takeaways
- Spark 3.x is a big leap forward with AQE, DPP, GPU support, ANSI SQL improvements.
- Delta Lake provides schema evolution, ACID transactions, and optimizations.
- Autoloader helps with scalable incremental ingestion.
- SQL + CTE + window functions (
ROW_NUMBER,DENSE_RANK) are must-know interview topics.
👉 Save this as a cheat sheet before your next Spark/SQL interview!






