Top SQL, PySpark, and AWS Cluster Interview Questions with Answers

Preparing for data engineering or data science interviews? Below is a structured blog post that highlights frequently asked questions around SQL, PySpark, Python, and AWS vs Spark Clusters, with answers and explanations.

🔹 SQL Interview Questions

1. How do you find the maximum total earnings for employees?

WITH cte AS (
    SELECT MAX(salary * months) AS max_total_earning
    FROM employee
)
SELECT COUNT(*) AS employee_count, max_total_earning
FROM employee, cte
WHERE salary * months = cte.max_total_earning;

👉 This returns the maximum earnings and the number of employees who achieved it.

2. How to get month-wise total sales and the top customer for each month?

WITH sales_cte AS (
    SELECT customer_id, 
           DATE_FORMAT(order_date, '%Y-%m') AS month,
           SUM(amount) AS total_amount
    FROM orders
    GROUP BY customer_id, DATE_FORMAT(order_date, '%Y-%m')
)
SELECT *
FROM (
    SELECT *,
           RANK() OVER (PARTITION BY month ORDER BY total_amount DESC) AS rnk
    FROM sales_cte
) ranked
WHERE rnk = 1;

👉 Uses window functions to get the highest spender per month.

3. How do you handle duplicates using `DENSE_RANK`?

SELECT name
FROM (
    SELECT name,
           DENSE_RANK() OVER (PARTITION BY name ORDER BY id) AS rnk
    FROM employees
) t
WHERE rnk > 1;

👉 Identifies duplicate names by ranking them.

4. How do you find duplicates based on multiple columns (name, dob, email)?

SELECT name, dob, email, COUNT(*) AS scount
FROM emp
GROUP BY name, dob, email
HAVING COUNT(*) > 1;

👉 Returns only those rows where the same name, dob, and email combination appears more than once.

🔹 PySpark Interview Questions

1. Difference between RDDs and DataFrames?

RDD: Low-level, untyped distributed collection. Good for complex transformations.
DataFrame: High-level, optimized via Catalyst Optimizer and Tungsten.
👉 Prefer DataFrames unless you need fine-grained control.

2. How do you implement SCD Type 2 in PySpark?

from delta.tables import DeltaTable
from pyspark.sql.functions import current_date, lit

tgt = DeltaTable.forName(spark, "employees")
src = spark.table("staging_employees")

(tgt.alias("tgt")
 .merge(src.alias("src"), "tgt.emp_id = src.emp_id AND tgt.is_active = true")
 .whenMatchedUpdate(
     condition="tgt.emp_name <> src.emp_name OR tgt.dept <> src.dept",
     set={"is_active": lit(False), "end_date": current_date()}
 )
 .whenNotMatchedInsert(values={
     "emp_id": "src.emp_id",
     "emp_name": "src.emp_name",
     "dept": "src.dept",
     "start_date": current_date(),
     "end_date": lit(None),
     "is_active": lit(True)
 })
 .execute())

👉 This keeps historical versions of employee records.

Implementing Slowly Changing Dimension (SCD) Type 2 in Spark SQL

📌 What is SCD Type 2?

In Data Warehousing, Slowly Changing Dimensions (SCD) handle changes in dimension data over time.

SCD Type 1: Overwrites old data with new values (no history).
SCD Type 2: Maintains historical versions of a record by adding new rows with validity dates.
SCD Type 3: Stores limited history (previous value in a new column).

In SCD Type 2, we track changes using fields like:

start_date → when the record became active
end_date → when the record was closed (NULL = current active)
is_active → flag to indicate the current version

⚡ Example: SCD Type 2 in Spark SQL

Suppose we have two Delta tables:

t1 → target dimension table (history + active records)
t2 → source table (latest snapshot of employees)

We can write a MERGE statement like this:

MERGE INTO t1 AS tgt
USING t2 AS src
ON tgt.emp_id = src.emp_id AND tgt.is_active = true   -- match only active record
WHEN MATCHED AND (
      tgt.emp_name <> src.emp_name OR tgt.dept <> src.dept  -- detect changes
) THEN UPDATE SET
    tgt.is_active = false,
    tgt.end_date = current_date   -- close old record

WHEN NOT MATCHED THEN
  INSERT (emp_id, emp_name, dept, start_date, end_date, is_active)
  VALUES (src.emp_id, src.emp_name, src.dept, current_date, NULL, true);

✅ Explanation

Matching Active Records: We only match against records where is_active = true.
Detecting Changes: If emp_name or dept has changed, we close the old record by setting:
- is_active = false
- end_date = current_date
Inserting New Record: If no active record exists for that employee, we insert a new row with:
- start_date = current_date
- end_date = NULL
- is_active = true

📊 Example Data Flow

Before update (t1):

emp_id	emp_name	dept	start_date	end_date	is_active
101	John	HR	2023-01-01	NULL	true

Incoming snapshot (t2):

emp_id	emp_name	dept
101	John	IT

After MERGE (t1):

emp_id	emp_name	dept	start_date	end_date	is_active
101	John	HR	2023-01-01	2025-09-19	false
101	John	IT	2025-09-19	NULL	true

🚀 Key Takeaways

SCD Type 2 allows tracking historical changes in dimension data.
Spark SQL with Delta Lake’s MERGE INTO is ideal for handling SCD2 efficiently.
Always include validity dates and active flags to simplify queries.

Scd Type 2 with DLT

here’s a DLT SCD Type 2 pipeline that tracks exact change timestamps using a last_modified or event_time column from the source data instead of just the pipeline run date. This is more accurate for historical auditing.

🟢 DLT SCD Type 2 with Timestamp-Based History

import dlt
from pyspark.sql.functions import col, current_timestamp

# ----------------------------
# 1️⃣ Staging Table (Raw Source)
# ----------------------------
@dlt.table(
    comment="Staging employees with event timestamp"
)
def staging_employees():
    df = (
        spark.read.format("json")
        .load("s3://my-bucket/raw/employees/")  # Replace with your S3 path
    )

    # Assume source has 'event_time' column indicating change timestamp
    # If not, use current timestamp as event time
    if "event_time" not in df.columns:
        df = df.withColumn("event_time", current_timestamp())

    # 🔹 Data quality checks
    dlt.expect("emp_id_not_null", "emp_id IS NOT NULL")(df)
    dlt.expect("salary_positive", "salary > 0")(df)

    return df

# --------------------------------------
# 2️⃣ Dimension Table (SCD Type 2)
# --------------------------------------
@dlt.table(
    comment="SCD Type 2 Employee Dimension with timestamp-based history",
    table_properties={"delta.enableChangeDataFeed": "true"}
)
def dim_employees():
    return spark.createDataFrame([], """
        emp_id INT,
        name STRING,
        dept STRING,
        salary INT,
        start_time TIMESTAMP,
        end_time TIMESTAMP,
        is_current BOOLEAN
    """)

# -------------------------------
# 3️⃣ Apply SCD Type 2 Merge
# -------------------------------
dlt.apply_changes(
    target="dim_employees",          # Target dimension table
    source="staging_employees",      # Source table
    keys=["emp_id"],                 # Business key
    sequence_by="event_time",        # Use event_time for ordering
    apply_as_deletes=False,          # Do not delete missing records
    column_list=["name", "dept", "salary"],  # Columns to track changes
    stored_as_scd_type=2,            # Enable SCD Type 2 behavior
    track_history_column_list=["name", "dept", "salary"],  # Columns whose changes are tracked
    start_time_column="event_time"   # Use exact event timestamp as start
)

🔹 How This Differs from Pipeline-Date SCD2

Event-based history
- Uses event_time or last_modified from source.
- Records show the actual change time, not just pipeline run time.
Better accuracy
- Multiple changes in a single day are tracked correctly.
- Historical reports reflect the real change order.
End time & current flag
- end_time is automatically set when a new version is inserted.
- is_current = true for the latest version, false for older ones.

🔹 Example Output

emp_id	name	dept	salary	start_time	end_time	is_current
1	John	HR	5000	2025-09-22 08:00	NULL	true
2	Mary	Finance	6000	2025-09-22 09:15	2025-10-01 12:30	false
2	Mary	Finance	7000	2025-10-01 12:30	NULL	true

⚡ Notes

sequence_by="event_time" ensures changes are applied in the correct chronological order.
start_time_column="event_time" tells DLT to use the actual change timestamp instead of pipeline execution time.
This works best if your source system provides event timestamps. Otherwise, current_timestamp() acts as a fallback.

When dlt.apply_changes runs:

It detects a change in the tracked columns (salary changed).

The old record is updated automatically:

end_date (or end_time) = timestamp of the new change (2025-10-01 12:30).

is_current = false.

The new record is inserted:

start_date = timestamp of new change

end_date = NULL

is_current = true.

3. Example of Pivot/Unpivot in PySpark

from pyspark.sql.functions import sum

# Pivot
sales.groupBy("month").pivot("category").agg(sum("amount"))

# Unpivot (melt)
unpivoted = sales.selectExpr("month", "stack(3, 'A', A, 'B', B, 'C', C) as (category, amount)")

🔹 Python Interview Questions

1. How do you mask an email address?

def mask_email(email: str) -> str:
    username, domain = email.split("@")
    if len(username) > 2:
        masked_username = username[0] + "*" * (len(username) - 2) + username[-1]
    else:
        masked_username = username[0] + "*"
    return masked_username + "@" + domain

# Example
print(mask_email("johndoe@example.com"))  # Output: j*****e@example.com

👉 Masks the email by keeping the first and last characters of the username visible, replacing the rest with *.

🔹 AWS Cluster vs Spark Cluster

1. What is an AWS Cluster?

AWS cluster generally refers to a group of EC2 instances or services (like EMR, Redshift, ECS, or EKS) working together.
Example: Amazon EMR cluster manages big data workloads.
Example: Redshift cluster for data warehousing.

2. What is a Spark Cluster?

Spark cluster = distributed Spark runtime with:
- Driver → coordinates tasks.
- Executors → perform computation.
Can run standalone, on YARN, Mesos, Kubernetes, or EMR.

3. AWS Cluster vs Spark Cluster Differences

Feature	AWS Cluster (e.g., EMR)	Spark Cluster
Definition	Group of AWS compute nodes/services	Runtime for Spark distributed jobs
Management	Managed by AWS (EMR, Redshift)	Managed by Spark master node
Use Case	Data warehouse, ETL, ML workloads	Distributed processing of big data
Flexibility	Limited to AWS ecosystem	Runs anywhere (on-prem/cloud/hybrid)

👉 In short: AWS cluster is infrastructure; Spark cluster is execution runtime.

🔹 Final Tips for Interviews

SQL → Practice CTEs, window functions, aggregations.
PySpark → Focus on DataFrame API, Delta Lake MERGE, window functions.
Python → Be ready with string manipulations, masking, data cleaning.
AWS vs Spark → Be clear on infra (AWS services) vs execution (Spark runtime).

✅ With these examples, you’ll be ready to tackle common SQL, PySpark, Python, and architecture interview questions with confidence!

Top SQL, PySpark, and AWS Cluster Interview Questions with Answers

🔹 SQL Interview Questions

1. How do you find the maximum total earnings for employees?

2. How to get month-wise total sales and the top customer for each month?

3. How do you handle duplicates using `DENSE_RANK`?

4. How do you find duplicates based on multiple columns (name, dob, email)?

🔹 PySpark Interview Questions

1. Difference between RDDs and DataFrames?

2. How do you implement SCD Type 2 in PySpark?

Implementing Slowly Changing Dimension (SCD) Type 2 in Spark SQL

📌 What is SCD Type 2?

⚡ Example: SCD Type 2 in Spark SQL

✅ Explanation

📊 Example Data Flow

🚀 Key Takeaways

Scd Type 2 with DLT

🟢 DLT SCD Type 2 with Timestamp-Based History

🔹 How This Differs from Pipeline-Date SCD2

🔹 Example Output

3. Example of Pivot/Unpivot in PySpark

🔹 Python Interview Questions

1. How do you mask an email address?

🔹 AWS Cluster vs Spark Cluster

1. What is an AWS Cluster?

2. What is a Spark Cluster?

3. AWS Cluster vs Spark Cluster Differences

🔹 Final Tips for Interviews

Latest Posts

Secure S3 File Upload Using API Gateway, Lambda & PostgreSQL (Complete AWS Architecture Guide

AI Agents in Data Engineering: Everything You Need to Know

The End-to-End AI Stack – A Real Guide for Developers to Code, Create, and Execute

10 Workplace Communication Speaking Exercises to Improve Fluency at Work

Subscribe for DAILY TIPS

Top SQL, PySpark, and AWS Cluster Interview Questions with Answers

🔹 SQL Interview Questions

1. How do you find the maximum total earnings for employees?

2. How to get month-wise total sales and the top customer for each month?

3. How do you handle duplicates using DENSE_RANK?

4. How do you find duplicates based on multiple columns (name, dob, email)?

🔹 PySpark Interview Questions

1. Difference between RDDs and DataFrames?

2. How do you implement SCD Type 2 in PySpark?

Implementing Slowly Changing Dimension (SCD) Type 2 in Spark SQL

📌 What is SCD Type 2?

⚡ Example: SCD Type 2 in Spark SQL

✅ Explanation

📊 Example Data Flow

🚀 Key Takeaways

Scd Type 2 with DLT

🟢 DLT SCD Type 2 with Timestamp-Based History

🔹 How This Differs from Pipeline-Date SCD2

🔹 Example Output

3. Example of Pivot/Unpivot in PySpark

🔹 Python Interview Questions

1. How do you mask an email address?

🔹 AWS Cluster vs Spark Cluster

1. What is an AWS Cluster?

2. What is a Spark Cluster?

3. AWS Cluster vs Spark Cluster Differences

🔹 Final Tips for Interviews

Share this:

Latest Posts

Secure S3 File Upload Using API Gateway, Lambda & PostgreSQL (Complete AWS Architecture Guide

AI Agents in Data Engineering: Everything You Need to Know

The End-to-End AI Stack – A Real Guide for Developers to Code, Create, and Execute

10 Workplace Communication Speaking Exercises to Improve Fluency at Work

3. How do you handle duplicates using `DENSE_RANK`?