In this post, we’ll cover some common Spark and SQL interview-style questions (asked at Nike) with practical examples. Whether you’re preparing for interviews or brushing up on PySpark/SQL concepts, this guide will help.

🔹 Spark 3.x vs Spark 2.x — Key Differences

FeatureSpark 2.xSpark 3.x
Adaptive Query Execution (AQE)❌ Not available✅ Available — dynamically optimizes queries at runtime
Dynamic Partition Pruning✅ Improves join performance with partitioned data
ANSI SQL ComplianceLimitedImproved compliance
Python SupportUp to Python 3.6Up to Python 3.8+
GPU Acceleration✅ Via RAPIDS plugin
PerformanceSlower on complex queriesFaster due to AQE, DPP, and optimizations

🔹 Spark Optimizations vs Delta Lake Optimizations

AspectSpark OptimizationsDelta Lake Optimizations
Query OptimizationCatalyst + TungstenSame + Delta-specific caching
ExecutionAdaptive Query ExecutionZ-ordering, Data skipping
Data HandlingRDDs/DataFramesACID transactions, schema enforcement
Schema EvolutionManual✅ Automatic schema evolution with version tracking

Delta Lake Example — Schema Evolution:

from delta.tables import *

(df.write.format("delta")
   .mode("append")
   .option("mergeSchema", "true")
   .save("/mnt/delta/events"))

This ensures schema evolution with version tracking.

🔹 How Autoloader Works in Databricks

  • Incrementally processes new files arriving in cloud storage (S3, ADLS, GCS).
  • Uses file notification mode (event-driven) or directory listing mode.
  • Handles schema inference + evolution.
  • Scales automatically with large volumes of data.

Example Code:

df = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .load("/mnt/input"))

🔹 SQL: Position-wise Highest Salary (CTE)

1. Using CTE + Aggregation

WITH max_sal_per_position AS (
    SELECT position, MAX(sal) AS max_salary
    FROM emp
    GROUP BY position
)
SELECT e.name, e.position, e.sal
FROM emp e
JOIN max_sal_per_position m
  ON e.position = m.position
 AND e.sal = m.max_salary;

✅ Returns highest-paid employee(s) per position.

2. Using CTE + DENSE_RANK

WITH ranked_emp AS (
    SELECT name, position, sal,
           DENSE_RANK() OVER (PARTITION BY position ORDER BY sal DESC) AS rn
    FROM emp
)
SELECT name, position, sal
FROM ranked_emp
WHERE rn = 1;

✅ Returns all top earners (handles salary ties).

3. Using CTE + ROW_NUMBER

WITH ranked_emp AS (
    SELECT name, position, sal,
           ROW_NUMBER() OVER (PARTITION BY position ORDER BY sal DESC) AS rn
    FROM emp
)
SELECT name, position, sal
FROM ranked_emp
WHERE rn = 1;

✅ Returns only one employee per position (even if ties exist).

🔹 PySpark Equivalent (DENSE_RANK)

from pyspark.sql.window import Window
from pyspark.sql.functions import dense_rank

windowSpec = Window.partitionBy("position").orderBy(col("sal").desc())

result = (emp_df
    .withColumn("rn", dense_rank().over(windowSpec))
    .filter("rn = 1"))

✅ Key Takeaways

  • Spark 3.x is a big leap forward with AQE, DPP, GPU support, ANSI SQL improvements.
  • Delta Lake provides schema evolution, ACID transactions, and optimizations.
  • Autoloader helps with scalable incremental ingestion.
  • SQL + CTE + window functions (ROW_NUMBER, DENSE_RANK) are must-know interview topics.

👉 Save this as a cheat sheet before your next Spark/SQL interview!