Preparing for data engineering or data science interviews? Below is a structured blog post that highlights frequently asked questions around SQL, PySpark, Python, and AWS vs Spark Clusters, with answers and explanations.

🔹 SQL Interview Questions

1. How do you find the maximum total earnings for employees?

WITH cte AS (
    SELECT MAX(salary * months) AS max_total_earning
    FROM employee
)
SELECT COUNT(*) AS employee_count, max_total_earning
FROM employee, cte
WHERE salary * months = cte.max_total_earning;

👉 This returns the maximum earnings and the number of employees who achieved it.

2. How to get month-wise total sales and the top customer for each month?

WITH sales_cte AS (
    SELECT customer_id, 
           DATE_FORMAT(order_date, '%Y-%m') AS month,
           SUM(amount) AS total_amount
    FROM orders
    GROUP BY customer_id, DATE_FORMAT(order_date, '%Y-%m')
)
SELECT *
FROM (
    SELECT *,
           RANK() OVER (PARTITION BY month ORDER BY total_amount DESC) AS rnk
    FROM sales_cte
) ranked
WHERE rnk = 1;

👉 Uses window functions to get the highest spender per month.

3. How do you handle duplicates using DENSE_RANK?

SELECT name
FROM (
    SELECT name,
           DENSE_RANK() OVER (PARTITION BY name ORDER BY id) AS rnk
    FROM employees
) t
WHERE rnk > 1;

👉 Identifies duplicate names by ranking them.

4. How do you find duplicates based on multiple columns (name, dob, email)?

SELECT name, dob, email, COUNT(*) AS scount
FROM emp
GROUP BY name, dob, email
HAVING COUNT(*) > 1;

👉 Returns only those rows where the same name, dob, and email combination appears more than once.

🔹 PySpark Interview Questions

1. Difference between RDDs and DataFrames?

  • RDD: Low-level, untyped distributed collection. Good for complex transformations.
  • DataFrame: High-level, optimized via Catalyst Optimizer and Tungsten.
    👉 Prefer DataFrames unless you need fine-grained control.

2. How do you implement SCD Type 2 in PySpark?

from delta.tables import DeltaTable
from pyspark.sql.functions import current_date, lit

tgt = DeltaTable.forName(spark, "employees")
src = spark.table("staging_employees")

(tgt.alias("tgt")
 .merge(src.alias("src"), "tgt.emp_id = src.emp_id AND tgt.is_active = true")
 .whenMatchedUpdate(
     condition="tgt.emp_name <> src.emp_name OR tgt.dept <> src.dept",
     set={"is_active": lit(False), "end_date": current_date()}
 )
 .whenNotMatchedInsert(values={
     "emp_id": "src.emp_id",
     "emp_name": "src.emp_name",
     "dept": "src.dept",
     "start_date": current_date(),
     "end_date": lit(None),
     "is_active": lit(True)
 })
 .execute())

👉 This keeps historical versions of employee records.

Implementing Slowly Changing Dimension (SCD) Type 2 in Spark SQL

📌 What is SCD Type 2?

In Data Warehousing, Slowly Changing Dimensions (SCD) handle changes in dimension data over time.

  • SCD Type 1: Overwrites old data with new values (no history).
  • SCD Type 2: Maintains historical versions of a record by adding new rows with validity dates.
  • SCD Type 3: Stores limited history (previous value in a new column).

In SCD Type 2, we track changes using fields like:

  • start_date → when the record became active
  • end_date → when the record was closed (NULL = current active)
  • is_active → flag to indicate the current version

⚡ Example: SCD Type 2 in Spark SQL

Suppose we have two Delta tables:

  • t1 → target dimension table (history + active records)
  • t2 → source table (latest snapshot of employees)

We can write a MERGE statement like this:

MERGE INTO t1 AS tgt
USING t2 AS src
ON tgt.emp_id = src.emp_id AND tgt.is_active = true   -- match only active record
WHEN MATCHED AND (
      tgt.emp_name <> src.emp_name OR tgt.dept <> src.dept  -- detect changes
) THEN UPDATE SET
    tgt.is_active = false,
    tgt.end_date = current_date   -- close old record

WHEN NOT MATCHED THEN
  INSERT (emp_id, emp_name, dept, start_date, end_date, is_active)
  VALUES (src.emp_id, src.emp_name, src.dept, current_date, NULL, true);

✅ Explanation

  1. Matching Active Records: We only match against records where is_active = true.
  2. Detecting Changes: If emp_name or dept has changed, we close the old record by setting:
    • is_active = false
    • end_date = current_date
  3. Inserting New Record: If no active record exists for that employee, we insert a new row with:
    • start_date = current_date
    • end_date = NULL
    • is_active = true

📊 Example Data Flow

Before update (t1):

emp_idemp_namedeptstart_dateend_dateis_active
101JohnHR2023-01-01NULLtrue

Incoming snapshot (t2):

emp_idemp_namedept
101JohnIT

After MERGE (t1):

emp_idemp_namedeptstart_dateend_dateis_active
101JohnHR2023-01-012025-09-19false
101JohnIT2025-09-19NULLtrue

🚀 Key Takeaways

  • SCD Type 2 allows tracking historical changes in dimension data.
  • Spark SQL with Delta Lake’s MERGE INTO is ideal for handling SCD2 efficiently.
  • Always include validity dates and active flags to simplify queries.

Scd Type 2 with DLT

here’s a DLT SCD Type 2 pipeline that tracks exact change timestamps using a last_modified or event_time column from the source data instead of just the pipeline run date. This is more accurate for historical auditing.

🟢 DLT SCD Type 2 with Timestamp-Based History

import dlt
from pyspark.sql.functions import col, current_timestamp

# ----------------------------
# 1️⃣ Staging Table (Raw Source)
# ----------------------------
@dlt.table(
    comment="Staging employees with event timestamp"
)
def staging_employees():
    df = (
        spark.read.format("json")
        .load("s3://my-bucket/raw/employees/")  # Replace with your S3 path
    )

    # Assume source has 'event_time' column indicating change timestamp
    # If not, use current timestamp as event time
    if "event_time" not in df.columns:
        df = df.withColumn("event_time", current_timestamp())

    # 🔹 Data quality checks
    dlt.expect("emp_id_not_null", "emp_id IS NOT NULL")(df)
    dlt.expect("salary_positive", "salary > 0")(df)

    return df

# --------------------------------------
# 2️⃣ Dimension Table (SCD Type 2)
# --------------------------------------
@dlt.table(
    comment="SCD Type 2 Employee Dimension with timestamp-based history",
    table_properties={"delta.enableChangeDataFeed": "true"}
)
def dim_employees():
    return spark.createDataFrame([], """
        emp_id INT,
        name STRING,
        dept STRING,
        salary INT,
        start_time TIMESTAMP,
        end_time TIMESTAMP,
        is_current BOOLEAN
    """)

# -------------------------------
# 3️⃣ Apply SCD Type 2 Merge
# -------------------------------
dlt.apply_changes(
    target="dim_employees",          # Target dimension table
    source="staging_employees",      # Source table
    keys=["emp_id"],                 # Business key
    sequence_by="event_time",        # Use event_time for ordering
    apply_as_deletes=False,          # Do not delete missing records
    column_list=["name", "dept", "salary"],  # Columns to track changes
    stored_as_scd_type=2,            # Enable SCD Type 2 behavior
    track_history_column_list=["name", "dept", "salary"],  # Columns whose changes are tracked
    start_time_column="event_time"   # Use exact event timestamp as start
)

🔹 How This Differs from Pipeline-Date SCD2

  1. Event-based history
    • Uses event_time or last_modified from source.
    • Records show the actual change time, not just pipeline run time.
  2. Better accuracy
    • Multiple changes in a single day are tracked correctly.
    • Historical reports reflect the real change order.
  3. End time & current flag
    • end_time is automatically set when a new version is inserted.
    • is_current = true for the latest version, false for older ones.

🔹 Example Output

emp_idnamedeptsalarystart_timeend_timeis_current
1JohnHR50002025-09-22 08:00NULLtrue
2MaryFinance60002025-09-22 09:152025-10-01 12:30false
2MaryFinance70002025-10-01 12:30NULLtrue

Notes

  • sequence_by="event_time" ensures changes are applied in the correct chronological order.
  • start_time_column="event_time" tells DLT to use the actual change timestamp instead of pipeline execution time.
  • This works best if your source system provides event timestamps. Otherwise, current_timestamp() acts as a fallback.

When dlt.apply_changes runs:

  • It detects a change in the tracked columns (salary changed).
  • The old record is updated automatically:
    • end_date (or end_time) = timestamp of the new change (2025-10-01 12:30).
    • is_current = false.
  • The new record is inserted:
    • start_date = timestamp of new change
    • end_date = NULL
    • is_current = true.

3. Example of Pivot/Unpivot in PySpark

from pyspark.sql.functions import sum

# Pivot
sales.groupBy("month").pivot("category").agg(sum("amount"))

# Unpivot (melt)
unpivoted = sales.selectExpr("month", "stack(3, 'A', A, 'B', B, 'C', C) as (category, amount)")

🔹 Python Interview Questions

1. How do you mask an email address?

def mask_email(email: str) -> str:
    username, domain = email.split("@")
    if len(username) > 2:
        masked_username = username[0] + "*" * (len(username) - 2) + username[-1]
    else:
        masked_username = username[0] + "*"
    return masked_username + "@" + domain

# Example
print(mask_email("johndoe@example.com"))  # Output: j*****e@example.com

👉 Masks the email by keeping the first and last characters of the username visible, replacing the rest with *.

🔹 AWS Cluster vs Spark Cluster

1. What is an AWS Cluster?

  • AWS cluster generally refers to a group of EC2 instances or services (like EMR, Redshift, ECS, or EKS) working together.
  • Example: Amazon EMR cluster manages big data workloads.
  • Example: Redshift cluster for data warehousing.

2. What is a Spark Cluster?

  • Spark cluster = distributed Spark runtime with:
    • Driver → coordinates tasks.
    • Executors → perform computation.
  • Can run standalone, on YARN, Mesos, Kubernetes, or EMR.

3. AWS Cluster vs Spark Cluster Differences

FeatureAWS Cluster (e.g., EMR)Spark Cluster
DefinitionGroup of AWS compute nodes/servicesRuntime for Spark distributed jobs
ManagementManaged by AWS (EMR, Redshift)Managed by Spark master node
Use CaseData warehouse, ETL, ML workloadsDistributed processing of big data
FlexibilityLimited to AWS ecosystemRuns anywhere (on-prem/cloud/hybrid)

👉 In short: AWS cluster is infrastructure; Spark cluster is execution runtime.

🔹 Final Tips for Interviews

  1. SQL → Practice CTEs, window functions, aggregations.
  2. PySpark → Focus on DataFrame API, Delta Lake MERGE, window functions.
  3. Python → Be ready with string manipulations, masking, data cleaning.
  4. AWS vs Spark → Be clear on infra (AWS services) vs execution (Spark runtime).

✅ With these examples, you’ll be ready to tackle common SQL, PySpark, Python, and architecture interview questions with confidence!