Preparing for data engineering or data science interviews? Below is a structured blog post that highlights frequently asked questions around SQL, PySpark, Python, and AWS vs Spark Clusters, with answers and explanations.
🔹 SQL Interview Questions
1. How do you find the maximum total earnings for employees?
WITH cte AS (
SELECT MAX(salary * months) AS max_total_earning
FROM employee
)
SELECT COUNT(*) AS employee_count, max_total_earning
FROM employee, cte
WHERE salary * months = cte.max_total_earning;
👉 This returns the maximum earnings and the number of employees who achieved it.
2. How to get month-wise total sales and the top customer for each month?
WITH sales_cte AS (
SELECT customer_id,
DATE_FORMAT(order_date, '%Y-%m') AS month,
SUM(amount) AS total_amount
FROM orders
GROUP BY customer_id, DATE_FORMAT(order_date, '%Y-%m')
)
SELECT *
FROM (
SELECT *,
RANK() OVER (PARTITION BY month ORDER BY total_amount DESC) AS rnk
FROM sales_cte
) ranked
WHERE rnk = 1;
👉 Uses window functions to get the highest spender per month.
3. How do you handle duplicates using DENSE_RANK?
SELECT name
FROM (
SELECT name,
DENSE_RANK() OVER (PARTITION BY name ORDER BY id) AS rnk
FROM employees
) t
WHERE rnk > 1;
👉 Identifies duplicate names by ranking them.
4. How do you find duplicates based on multiple columns (name, dob, email)?
SELECT name, dob, email, COUNT(*) AS scount
FROM emp
GROUP BY name, dob, email
HAVING COUNT(*) > 1;
👉 Returns only those rows where the same name, dob, and email combination appears more than once.
🔹 PySpark Interview Questions
1. Difference between RDDs and DataFrames?
- RDD: Low-level, untyped distributed collection. Good for complex transformations.
- DataFrame: High-level, optimized via Catalyst Optimizer and Tungsten.
👉 Prefer DataFrames unless you need fine-grained control.
2. How do you implement SCD Type 2 in PySpark?
from delta.tables import DeltaTable
from pyspark.sql.functions import current_date, lit
tgt = DeltaTable.forName(spark, "employees")
src = spark.table("staging_employees")
(tgt.alias("tgt")
.merge(src.alias("src"), "tgt.emp_id = src.emp_id AND tgt.is_active = true")
.whenMatchedUpdate(
condition="tgt.emp_name <> src.emp_name OR tgt.dept <> src.dept",
set={"is_active": lit(False), "end_date": current_date()}
)
.whenNotMatchedInsert(values={
"emp_id": "src.emp_id",
"emp_name": "src.emp_name",
"dept": "src.dept",
"start_date": current_date(),
"end_date": lit(None),
"is_active": lit(True)
})
.execute())
👉 This keeps historical versions of employee records.
Implementing Slowly Changing Dimension (SCD) Type 2 in Spark SQL
📌 What is SCD Type 2?
In Data Warehousing, Slowly Changing Dimensions (SCD) handle changes in dimension data over time.
- SCD Type 1: Overwrites old data with new values (no history).
- SCD Type 2: Maintains historical versions of a record by adding new rows with validity dates.
- SCD Type 3: Stores limited history (previous value in a new column).
In SCD Type 2, we track changes using fields like:
start_date→ when the record became activeend_date→ when the record was closed (NULL = current active)is_active→ flag to indicate the current version
⚡ Example: SCD Type 2 in Spark SQL
Suppose we have two Delta tables:
t1→ target dimension table (history + active records)t2→ source table (latest snapshot of employees)
We can write a MERGE statement like this:
MERGE INTO t1 AS tgt
USING t2 AS src
ON tgt.emp_id = src.emp_id AND tgt.is_active = true -- match only active record
WHEN MATCHED AND (
tgt.emp_name <> src.emp_name OR tgt.dept <> src.dept -- detect changes
) THEN UPDATE SET
tgt.is_active = false,
tgt.end_date = current_date -- close old record
WHEN NOT MATCHED THEN
INSERT (emp_id, emp_name, dept, start_date, end_date, is_active)
VALUES (src.emp_id, src.emp_name, src.dept, current_date, NULL, true);
✅ Explanation
- Matching Active Records: We only match against records where
is_active = true. - Detecting Changes: If
emp_nameordepthas changed, we close the old record by setting:is_active = falseend_date = current_date
- Inserting New Record: If no active record exists for that employee, we insert a new row with:
start_date = current_dateend_date = NULLis_active = true
📊 Example Data Flow
Before update (t1):
| emp_id | emp_name | dept | start_date | end_date | is_active |
|---|---|---|---|---|---|
| 101 | John | HR | 2023-01-01 | NULL | true |
Incoming snapshot (t2):
| emp_id | emp_name | dept |
|---|---|---|
| 101 | John | IT |
After MERGE (t1):
| emp_id | emp_name | dept | start_date | end_date | is_active |
|---|---|---|---|---|---|
| 101 | John | HR | 2023-01-01 | 2025-09-19 | false |
| 101 | John | IT | 2025-09-19 | NULL | true |
🚀 Key Takeaways
- SCD Type 2 allows tracking historical changes in dimension data.
- Spark SQL with Delta Lake’s
MERGE INTOis ideal for handling SCD2 efficiently. - Always include validity dates and active flags to simplify queries.
Scd Type 2 with DLT
here’s a DLT SCD Type 2 pipeline that tracks exact change timestamps using a last_modified or event_time column from the source data instead of just the pipeline run date. This is more accurate for historical auditing.
🟢 DLT SCD Type 2 with Timestamp-Based History
import dlt
from pyspark.sql.functions import col, current_timestamp
# ----------------------------
# 1️⃣ Staging Table (Raw Source)
# ----------------------------
@dlt.table(
comment="Staging employees with event timestamp"
)
def staging_employees():
df = (
spark.read.format("json")
.load("s3://my-bucket/raw/employees/") # Replace with your S3 path
)
# Assume source has 'event_time' column indicating change timestamp
# If not, use current timestamp as event time
if "event_time" not in df.columns:
df = df.withColumn("event_time", current_timestamp())
# 🔹 Data quality checks
dlt.expect("emp_id_not_null", "emp_id IS NOT NULL")(df)
dlt.expect("salary_positive", "salary > 0")(df)
return df
# --------------------------------------
# 2️⃣ Dimension Table (SCD Type 2)
# --------------------------------------
@dlt.table(
comment="SCD Type 2 Employee Dimension with timestamp-based history",
table_properties={"delta.enableChangeDataFeed": "true"}
)
def dim_employees():
return spark.createDataFrame([], """
emp_id INT,
name STRING,
dept STRING,
salary INT,
start_time TIMESTAMP,
end_time TIMESTAMP,
is_current BOOLEAN
""")
# -------------------------------
# 3️⃣ Apply SCD Type 2 Merge
# -------------------------------
dlt.apply_changes(
target="dim_employees", # Target dimension table
source="staging_employees", # Source table
keys=["emp_id"], # Business key
sequence_by="event_time", # Use event_time for ordering
apply_as_deletes=False, # Do not delete missing records
column_list=["name", "dept", "salary"], # Columns to track changes
stored_as_scd_type=2, # Enable SCD Type 2 behavior
track_history_column_list=["name", "dept", "salary"], # Columns whose changes are tracked
start_time_column="event_time" # Use exact event timestamp as start
)
🔹 How This Differs from Pipeline-Date SCD2
- Event-based history
- Uses
event_timeorlast_modifiedfrom source. - Records show the actual change time, not just pipeline run time.
- Uses
- Better accuracy
- Multiple changes in a single day are tracked correctly.
- Historical reports reflect the real change order.
- End time & current flag
end_timeis automatically set when a new version is inserted.is_current = truefor the latest version,falsefor older ones.
🔹 Example Output
| emp_id | name | dept | salary | start_time | end_time | is_current |
|---|---|---|---|---|---|---|
| 1 | John | HR | 5000 | 2025-09-22 08:00 | NULL | true |
| 2 | Mary | Finance | 6000 | 2025-09-22 09:15 | 2025-10-01 12:30 | false |
| 2 | Mary | Finance | 7000 | 2025-10-01 12:30 | NULL | true |
⚡ Notes
sequence_by="event_time"ensures changes are applied in the correct chronological order.start_time_column="event_time"tells DLT to use the actual change timestamp instead of pipeline execution time.- This works best if your source system provides event timestamps. Otherwise,
current_timestamp()acts as a fallback.
When
dlt.apply_changesruns:
- It detects a change in the tracked columns (
salarychanged).- The old record is updated automatically:
end_date(orend_time) = timestamp of the new change (2025-10-01 12:30).is_current = false.- The new record is inserted:
start_date= timestamp of new changeend_date = NULLis_current = true.
3. Example of Pivot/Unpivot in PySpark
from pyspark.sql.functions import sum
# Pivot
sales.groupBy("month").pivot("category").agg(sum("amount"))
# Unpivot (melt)
unpivoted = sales.selectExpr("month", "stack(3, 'A', A, 'B', B, 'C', C) as (category, amount)")
🔹 Python Interview Questions
1. How do you mask an email address?
def mask_email(email: str) -> str:
username, domain = email.split("@")
if len(username) > 2:
masked_username = username[0] + "*" * (len(username) - 2) + username[-1]
else:
masked_username = username[0] + "*"
return masked_username + "@" + domain
# Example
print(mask_email("johndoe@example.com")) # Output: j*****e@example.com
👉 Masks the email by keeping the first and last characters of the username visible, replacing the rest with *.
🔹 AWS Cluster vs Spark Cluster
1. What is an AWS Cluster?
- AWS cluster generally refers to a group of EC2 instances or services (like EMR, Redshift, ECS, or EKS) working together.
- Example: Amazon EMR cluster manages big data workloads.
- Example: Redshift cluster for data warehousing.
2. What is a Spark Cluster?
- Spark cluster = distributed Spark runtime with:
- Driver → coordinates tasks.
- Executors → perform computation.
- Can run standalone, on YARN, Mesos, Kubernetes, or EMR.
3. AWS Cluster vs Spark Cluster Differences
| Feature | AWS Cluster (e.g., EMR) | Spark Cluster |
|---|---|---|
| Definition | Group of AWS compute nodes/services | Runtime for Spark distributed jobs |
| Management | Managed by AWS (EMR, Redshift) | Managed by Spark master node |
| Use Case | Data warehouse, ETL, ML workloads | Distributed processing of big data |
| Flexibility | Limited to AWS ecosystem | Runs anywhere (on-prem/cloud/hybrid) |
👉 In short: AWS cluster is infrastructure; Spark cluster is execution runtime.
🔹 Final Tips for Interviews
- SQL → Practice CTEs, window functions, aggregations.
- PySpark → Focus on DataFrame API, Delta Lake
MERGE, window functions. - Python → Be ready with string manipulations, masking, data cleaning.
- AWS vs Spark → Be clear on infra (AWS services) vs execution (Spark runtime).
✅ With these examples, you’ll be ready to tackle common SQL, PySpark, Python, and architecture interview questions with confidence!






