Top 10 Nike Data Engineer Interview Questions on Spark, SQL, and Delta Lake (2025)

In this post, we’ll cover some common Spark and SQL interview-style questions (asked at Nike) with practical examples. Whether you’re preparing for interviews or brushing up on PySpark/SQL concepts, this guide will help.

🔹 Spark 3.x vs Spark 2.x — Key Differences

Feature	Spark 2.x	Spark 3.x
Adaptive Query Execution (AQE)	❌ Not available	✅ Available — dynamically optimizes queries at runtime
Dynamic Partition Pruning	❌	✅ Improves join performance with partitioned data
ANSI SQL Compliance	Limited	Improved compliance
Python Support	Up to Python 3.6	Up to Python 3.8+
GPU Acceleration	❌	✅ Via RAPIDS plugin
Performance	Slower on complex queries	Faster due to AQE, DPP, and optimizations

🔹 Spark Optimizations vs Delta Lake Optimizations

Aspect	Spark Optimizations	Delta Lake Optimizations
Query Optimization	Catalyst + Tungsten	Same + Delta-specific caching
Execution	Adaptive Query Execution	Z-ordering, Data skipping
Data Handling	RDDs/DataFrames	ACID transactions, schema enforcement
Schema Evolution	Manual	✅ Automatic schema evolution with version tracking

Delta Lake Example — Schema Evolution:

from delta.tables import *

(df.write.format("delta")
   .mode("append")
   .option("mergeSchema", "true")
   .save("/mnt/delta/events"))

This ensures schema evolution with version tracking.

🔹 How Autoloader Works in Databricks

Incrementally processes new files arriving in cloud storage (S3, ADLS, GCS).
Uses file notification mode (event-driven) or directory listing mode.
Handles schema inference + evolution.
Scales automatically with large volumes of data.

Example Code:

df = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .load("/mnt/input"))

🔹 SQL: Position-wise Highest Salary (CTE)

1. Using CTE + Aggregation

WITH max_sal_per_position AS (
    SELECT position, MAX(sal) AS max_salary
    FROM emp
    GROUP BY position
)
SELECT e.name, e.position, e.sal
FROM emp e
JOIN max_sal_per_position m
  ON e.position = m.position
 AND e.sal = m.max_salary;

✅ Returns highest-paid employee(s) per position.

2. Using CTE + DENSE_RANK

WITH ranked_emp AS (
    SELECT name, position, sal,
           DENSE_RANK() OVER (PARTITION BY position ORDER BY sal DESC) AS rn
    FROM emp
)
SELECT name, position, sal
FROM ranked_emp
WHERE rn = 1;

✅ Returns all top earners (handles salary ties).

3. Using CTE + ROW_NUMBER

WITH ranked_emp AS (
    SELECT name, position, sal,
           ROW_NUMBER() OVER (PARTITION BY position ORDER BY sal DESC) AS rn
    FROM emp
)
SELECT name, position, sal
FROM ranked_emp
WHERE rn = 1;

✅ Returns only one employee per position (even if ties exist).

🔹 PySpark Equivalent (DENSE_RANK)

from pyspark.sql.window import Window
from pyspark.sql.functions import dense_rank

windowSpec = Window.partitionBy("position").orderBy(col("sal").desc())

result = (emp_df
    .withColumn("rn", dense_rank().over(windowSpec))
    .filter("rn = 1"))

✅ Key Takeaways

Spark 3.x is a big leap forward with AQE, DPP, GPU support, ANSI SQL improvements.
Delta Lake provides schema evolution, ACID transactions, and optimizations.
Autoloader helps with scalable incremental ingestion.
SQL + CTE + window functions (ROW_NUMBER, DENSE_RANK) are must-know interview topics.

👉 Save this as a cheat sheet before your next Spark/SQL interview!

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.