Top Databricks PySpark and AWS Questions for Senior Data Engineers

Here are thoughtful interview questions that are useful for client interviews, along with clear and concise answers.

Top Client Interview Questions

1. How would you design a real-time data pipeline in Databricks to process streaming data from Kinesis Firehose?

Design:

Data Ingestion: Stream data into Kinesis Firehose, deliver it to S3 (raw zone).
Databricks Autoloader (structured streaming) continuously reads new files from S3.
Transformations: Apply Delta Live Tables / Structured Streaming jobs for cleansing, enrichment, and aggregations.
Storage: Persist results in Delta Lake (bronze → silver → gold layers).
Consumption: Expose curated data to BI tools (Power BI, Tableau) or ML models in near real-time.
Reliability: Use checkpointing, schema evolution, and ACID Delta tables.

🔥 Elevator pitch: “I’ll stream via Kinesis Firehose → S3, use Databricks Autoloader with Structured Streaming + Delta Lake for real-time processing, and serve curated data to BI/ML with Delta Live Tables ensuring reliability and scalability.”

2. Describe a scenario where you had to integrate data from multiple AWS services into Databricks. What challenges did you face, and how did you resolve them?

I worked on a project where data was coming from multiple AWS sources — transactional data from RDS (Postgres), clickstream logs via Kinesis Firehose into S3, and reference/master data from DynamoDB. All of this had to be integrated and processed in Databricks for downstream analytics.

Challenges & Resolutions:

Schema Evolution – Different sources had evolving schemas.
- Resolution: Used Databricks Autoloader with schema evolution and enforced Delta Lake bronze-silver-gold layers to handle changes gracefully.
Data Latency – Kinesis + Firehose delivered near real-time, while RDS/DynamoDB were batch-ingested.
- Resolution: Designed a Lambda + Glue job to micro-batch RDS data into S3, aligning ingestion frequency with streaming data.
Security & Access Management – Multiple services, IAM roles, and cross-account access.
- Resolution: Implemented IAM role-based access with instance profiles in Databricks, and stored secrets in AWS Secrets Manager + Databricks Secret Scope.
Cost Optimization – Large raw data in S3 was costly to process repeatedly.
- Resolution: Partitioned Delta tables by date/user, and used Z-Ordering for query efficiency.

Outcome: A scalable, near real-time data pipeline in Databricks where business teams could query curated Delta tables in minutes instead of hours.

3. How do you approach testing and debugging PySpark code in Databricks? What tools or techniques do you use?

Approach to Testing & Debugging PySpark in Databricks:

Unit Testing
- Write modular PySpark code and test transformations with pytest + chispa (PySpark DataFrame assertions).
- Mock small input DataFrames in notebooks for quick validation.
Notebook Debugging
- Use display() / df.show() on sampled data.
- Apply df.explain() to check query plans and optimize joins/shuffles.
- Use %sql magic commands to validate transformations in SQL.
Logging & Monitoring
- Use Spark logs + Databricks cluster driver logs for errors.
- Add structured logging with log4j in jobs.
- Monitor via Ganglia / Spark UI for stages, shuffles, skew.
Integration Testing
- Test with bronze → silver → gold layers in Delta Lake.
- Use checkpointing in streaming to ensure restart consistency.
Performance Debugging
- Check skewed partitions, caching, broadcast joins.
- Optimize with partitioning, bucketing, Z-ordering.

🔥 Elevator Pitch: “I test PySpark with pytest + chispa, debug using Spark UI, driver logs, and df.explain(), and ensure reliability with checkpointing and Delta Lake testing across layers. For performance issues, I monitor skew, optimize joins, and tune partitions.”

4. Can you explain how to implement Slowly Changing Dimension (SCD) Type 2 in Databricks using PySpark?

Approach-1

Goal: Maintain full history of changes (e.g., employee address changes) by marking old records as inactive and inserting new ones.

Steps:

Read Source & Target
- Source: incoming data (new/updated records).
- Target: existing Delta table holding SCD2 history.
Identify Changes
- Join source with target on business key (e.g., employee_id).
- Compare attributes (e.g., address, department).
Flag Changes
- If no match → Insert as new record.
- If match but attribute changed →
  - Update existing record: set end_date = current_date, is_active = false.
  - Insert new record: with start_date = current_date, end_date = null, is_active = true.
Write with Delta Merge

from delta.tables import DeltaTable
from pyspark.sql.functions import current_date

delta_tbl = DeltaTable.forPath(spark, "/mnt/delta/employee")

delta_tbl.alias("t") \
  .merge(source_df.alias("s"), "t.emp_id = s.emp_id") \
  .whenMatchedUpdate(
      condition="t.is_active = true AND (t.address <> s.address OR t.dept <> s.dept)",
      set={
          "end_date": "current_date",
          "is_active": "false"
      }
  ) \
  .whenNotMatchedInsert(
      values={
          "emp_id": "s.emp_id",
          "name": "s.name",
          "address": "s.address",
          "dept": "s.dept",
          "start_date": "current_date",
          "end_date": "null",
          "is_active": "true"
      }
  ) \
  .execute()

Benefits of Delta Lake
- ACID transactions ensure consistency.
- Merge API simplifies SCD2 logic.
- Time travel allows historical audits.

🔥 Elevator pitch:
“In Databricks, I implement SCD Type 2 using Delta Lake’s merge. I join source vs. target, update old records with end_date + inactive flag, and insert

Approach-2

SCD Type 2 with DLT

Scenario: Customer profile updates (address/phone/email changes).

Steps:

Define Source & Target
- Source: incoming change feed (from Kafka/Kinesis/S3).
- Target: DLT-managed Delta table holding historical data.
Use APPLY CHANGES INTO

CREATE OR REFRESH STREAMING LIVE TABLE customers_scd2
TBLPROPERTIES ("quality" = "silver")
AS APPLY CHANGES INTO LIVE.customers_scd2
FROM STREAM(LIVE.customers_incoming)
KEYS (customer_id)
APPLY AS DELETE WHEN row_status = 'D'
SEQUENCE BY update_timestamp
COLUMNS * 
STORED AS SCD TYPE 2;

Explanation:
- KEYS(customer_id) → business key.
- SEQUENCE BY update_timestamp → ensures latest record wins.
- STORED AS SCD TYPE 2 → automatically manages start_time, end_time, and active_flag.
DLT Handles:
- Expiring old records (setting end_date).
- Inserting new versions.
- Maintaining history automatically.

🔥 Elevator Pitch:
“With DLT, I use APPLY CHANGES INTO with STORED AS SCD TYPE 2. It auto-handles history, sets valid-from/valid-to dates, and maintains active flags, so I don’t have to manually code Delta merges.”

Approach-3

SCD Type 2 in Databricks Delta Live Tables (DLT) with PySpark

DLT supports SCD Type 2 natively through the apply_changes API (in Python) — no need to hand-code Delta merge logic.

Example 👇

import dlt
from pyspark.sql.functions import col

# Source stream (e.g., incoming changes from S3/Kafka/Kinesis)
@dlt.table(
  comment="Incoming customer updates",
  temporary=True
)
def customers_incoming():
    return spark.readStream.format("cloudFiles") \
        .option("cloudFiles.format", "json") \
        .load("/mnt/raw/customers")

# Target SCD Type 2 table
dlt.apply_changes(
    target = "customers_scd2",
    source = "customers_incoming",
    keys = ["customer_id"],
    sequence_by = col("update_timestamp"),
    stored_as_scd_type = "2"
)

🔑 How it works

keys = ["customer_id"] → defines the business key.
sequence_by → ensures updates are applied in correct order.
stored_as_scd_type = "2" → DLT automatically manages:
- Closing old records (end_date, is_active = false)
- Inserting new records (start_date, is_active = true)
- Preserving full history

🔥 Elevator Pitch:
“In PySpark with DLT, I use dlt.apply_changes with stored_as_scd_type = "2". DLT auto-handles inserting new versions, expiring old ones, and maintaining history columns — making SCD Type 2 implementation much cleaner than hand-written Delta merge code.”

5. Share your experience with Delta Lake. Have you used it to implement a medallion architecture in Databricks?

Experience with Delta Lake & Medallion Architecture

Yes, I’ve worked extensively with Delta Lake on Databricks and implemented the Medallion Architecture (Bronze → Silver → Gold) in real projects.

Bronze Layer (Raw Ingestion)
- Landed raw data from AWS S3 / Kinesis / RDS directly into Delta tables.
- Used Databricks Autoloader for continuous ingestion with schema evolution.
Silver Layer (Cleansed & Conformed)
- Applied transformations, deduplication, and joins to standardize data.
- Implemented Slowly Changing Dimensions (SCD Type 2) using Delta Merge / dlt.apply_changes for historical tracking.
- Enforced data quality checks with Delta Live Tables expectations.
Gold Layer (Business-Ready)
- Built aggregated fact tables and business metrics (e.g., customer 360, sales trends).
- Optimized Delta tables with Z-Ordering and partitioning for BI/ML workloads.
- Served data to Power BI / Tableau / ML models.

Delta Lake Benefits I leveraged:

ACID transactions → reliable pipeline with no data corruption.
Time travel → easy rollback & debugging.
Schema evolution → handled evolving source systems.
Performance tuning → caching, auto-optimize, and optimize writes.

Outcome: A scalable, governed data lakehouse where analysts and data scientists could consume business-ready data within minutes instead of hours.

🔥 Elevator Pitch Version:
“Yes, I’ve used Delta Lake to implement Medallion Architecture. Bronze for raw ingestion, Silver for cleansing and conformance (including SCD2), and Gold for BI/ML consumption. Delta’s ACID, time travel, and schema evolution made the pipeline reliable, scalable, and easy to maintain.”

6. Explain the difference between Spark DataFrame and Spark SQL, and when to use one over the other in Databricks?

Difference: Spark DataFrame vs Spark SQL

Spark DataFrame (API)
- Programmatic interface in Python/Scala/Java.
- Operations expressed as transformations (withColumn, filter, groupBy).
- Gives more flexibility for complex business logic, UDFs, dynamic pipelines.
Spark SQL
- Declarative interface using SQL queries (spark.sql("SELECT ...")).
- Familiar to analysts; easier for ad-hoc queries, joins, aggregations.
- Integrates well with notebooks (%sql) and BI tools.

👉 Both ultimately compile down to the same Catalyst Optimizer and execution engine, so performance is comparable.

When to Use

DataFrame API:
- When building production ETL pipelines.
- Need programmatic control, conditional logic, or PySpark functions.
- Example: implementing SCD Type 2 or ML feature engineering.
Spark SQL:
- When the team is SQL-centric (analysts, BI developers).
- For ad-hoc exploration or writing clean, simple queries.
- In Delta Live Tables, %sql pipelines are common for declarative ETL.

🔥 Elevator Pitch:
“Spark DataFrame API is best for programmatic, complex ETL logic, while Spark SQL is ideal for declarative queries and exploration. Both run on the same engine, so I usually mix them in Databricks depending on the use case — SQL for readability, DataFrames for flexibility.”

7. How would you optimize a running Spark job? Walk through the troubleshooting process and potential solutions.

Troubleshooting & Optimizing a Slow Spark Job

1. Understand the Problem

Check Spark UI (stages, tasks, DAG).
Identify bottlenecks → skewed partitions, shuffles, wide transformations.
Look at driver/executor logs for GC or OOM issues.

2. Common Bottlenecks & Fixes

Data Skew (few partitions with huge data)
🔹 Solution: Use salting, repartitioning, or broadcast joins for small tables.
Too Many/Too Few Partitions
🔹 Solution: Tune spark.sql.shuffle.partitions (default 200) to match cluster size and data volume.
Large Joins & Shuffles
🔹 Solution: Use broadcast joins for small tables (broadcast(df)), cache reused tables, or pre-partition data.
Unnecessary Wide Transformations
🔹 Solution: Minimize groupBy / distinct / orderBy. Use window functions or incremental aggregation where possible.
Serialization & UDF Overhead
🔹 Solution: Replace Python UDFs with Spark SQL functions or Pandas UDFs.
Storage Format & Reads
🔹 Solution: Store in Delta/Parquet, partition by query columns, and use Z-Ordering in Delta Lake.
Cluster Issues
🔹 Solution: Scale cluster appropriately (auto-scaling, right instance type). Monitor executor memory/CPU in Ganglia/CloudWatch.

3. Process I Follow

Step 1: Reproduce issue on sample data.
Step 2: Analyze Spark UI for skew/shuffles.
Step 3: Tune partitions, joins, caching.
Step 4: Optimize storage layout.
Step 5: Scale compute only if inefficiencies are fixed.

🔥 Elevator Pitch:
“I debug slow Spark jobs by first analyzing Spark UI for skew and shuffles. I fix skew with salting/broadcast joins, tune partitions, replace UDFs with native functions, and optimize Delta tables with partitioning/Z-order. Only after fixing inefficiencies, I consider scaling the cluster.”

8. Can you explain the complex data pipeline you’ve built using PySpark in Databricks and how you overcame challenges?

Complex Data Pipeline Example

Scenario:
I built a real-time + batch data pipeline in Databricks with PySpark for a retail client.

Sources: Clickstream data via Kinesis Firehose → S3, transactional sales data from RDS, product/master data from DynamoDB.
Pipeline Design: Implemented Medallion Architecture (Bronze → Silver → Gold) using Delta Lake.

Steps:

Bronze Layer
- Used Databricks Autoloader to continuously ingest Firehose-delivered data from S3.
- Batch-ingested RDS data with JDBC + incremental loads.
- Landed all raw data in Delta format.
Silver Layer
- Cleansed, deduplicated, and joined multiple sources.
- Implemented SCD Type 2 for product/customer data using Delta Merge.
- Enforced data quality rules with DLT expectations (e.g., null checks, type checks).
Gold Layer
- Created aggregated fact tables (sales per store, customer 360).
- Optimized tables with Z-ordering and partitioning.
- Served curated data to BI dashboards (Power BI) and ML models.

Challenges & Solutions

Schema Evolution (clickstream JSON had new fields frequently)
- ✅ Used Autoloader’s schema evolution and evolved Silver tables dynamically.
Data Skew in Joins (few customers had massive clickstream data)
- ✅ Applied salting + broadcast joins to balance partitions.
Late Arriving & Out-of-Order Data (from Kinesis)
- ✅ Used watermarking in Structured Streaming to handle late events gracefully.
Security & Access Management
- ✅ Integrated with AWS Secrets Manager + Databricks secret scopes for credentials, IAM role-based access to S3.
Performance Optimization
- ✅ Tuned spark.sql.shuffle.partitions, cached reused data, and used Delta cache + Z-order for fast queries.

Outcome

Reduced ingestion-to-insight latency from 6 hours → near real-time (<5 min).
Delivered a scalable lakehouse pipeline with reliable historical tracking and business-ready data.

🔥 Elevator Pitch:
“I built a real-time + batch pipeline in Databricks with PySpark, ingesting from Kinesis, RDS, and DynamoDB into a Delta Lake medallion architecture. Biggest challenges were schema evolution, skew, and late data. I solved them using Autoloader, salting/broadcast joins, and watermarking. The pipeline scaled well and cut latency from hours to minutes.”

What is Watermarking?

In streaming pipelines, events don’t always arrive in order (network delays, retries).
Watermark tells Spark how long to wait for late data based on an event-time column.
Once watermark threshold passes, Spark drops late data to avoid unbounded state growth.

How It Works

Example:

from pyspark.sql.functions import window

events = spark.readStream \
    .format("kinesis") \
    .option("streamName", "clickstream") \
    .load()

# Apply watermarking on event_time column
aggregated = events \
    .withWatermark("event_time", "10 minutes") \
    .groupBy(
        window("event_time", "5 minutes"),
        "user_id"
    ).count()

👉 Here:

withWatermark("event_time", "10 minutes") means:
- Spark will wait up to 10 minutes for late events.
- If an event for event_time = 1:00 PM arrives after 1:10 PM, it will be discarded.
This keeps state small and prevents memory blowups.

When I Use It in Databricks

In real-time pipelines (Kinesis → Databricks) where clickstream/events may arrive late.
Ensures aggregates (like number of clicks per user per 5 minutes) stay accurate but bounded.
Critical in gold layer metrics where stale data must not cause duplicates.

🔥 Elevator Pitch:
“Watermarking in PySpark streaming tells Spark how long to wait for late events. For example, with a 10-minute watermark, Spark will drop data arriving more than 10 minutes late, keeping state small and queries performant. I use it in Databricks pipelines for clickstream and IoT data.”

9. Can you explain the key differences between RDD, DataFrame, and Dataset, and when you would use one over the other?

RDD vs DataFrame vs Dataset

1. RDD (Resilient Distributed Dataset)

Low-level API → distributed collection of objects.
Type safety (in Scala/Java), but no schema (unstructured).
No Catalyst optimization → less efficient.
Best for: fine-grained transformations, unstructured/semi-structured data, or legacy Spark code.

2. DataFrame

High-level API → distributed table with rows & named columns.
Backed by Catalyst optimizer + Tungsten execution → very efficient.
Similar to SQL table.
Available in PySpark, Scala, Java, R.
Best for: ETL pipelines, analytics, SQL-like queries, ML preprocessing.

3. Dataset (Scala/Java only)

Combination of RDD + DataFrame.
Provides type safety + object-oriented programming (like RDD) + optimization (like DataFrame).
Not available in PySpark.
Best for: strongly typed use cases where compile-time checks are important (e.g., financial transactions).

When to Use Which?

RDD → When you need low-level control, custom transformations, or work with unstructured data.
DataFrame → Default choice for ETL, aggregations, SQL queries, ML feature engineering.
Dataset → In Scala/Java projects where type safety and compile-time checks are critical.

🔥 Elevator Pitch:
“RDD is low-level and unoptimized, DataFrame is high-level and optimized with Catalyst, and Dataset combines both but is only in Scala/Java. In Databricks, I mostly use DataFrames for performance and simplicity, and only fall back to RDDs for unstructured data or custom logic.”

10. Can you explain the concept of dynamic partition pruning in Spark SQL, how it works, and in what scenarios it improves query performance?

Dynamic Partition Pruning (DPP) in Spark SQL

Concept

Partition pruning = reducing the number of partitions scanned by filtering partitions early.
Static Partition Pruning: filter is known at compile time → Spark reads only matching partitions.
Dynamic Partition Pruning (DPP): filter value is not known until runtime (comes from another query/subquery). Spark waits to evaluate it, then prunes partitions dynamically.

How It Works

Example:

-- Orders partitioned by order_date
SELECT *
FROM orders o
JOIN customers c
ON o.customer_id = c.customer_id
WHERE o.order_date = c.signup_date;

👉 Here, o.order_date depends on c.signup_date (not known until query runs).

Without DPP → Spark scans all partitions of orders.
With DPP → Spark waits for customer filter result, prunes irrelevant partitions, and scans only required ones.

Scenarios Where DPP Helps

Fact table partitioned on date/key joined with a dimension table filter.
Ad-hoc filters (e.g., user-selected filter in BI dashboard).
Large partitioned Delta/Parquet tables where scanning all partitions would be very costly.

Performance Benefit

Reduces unnecessary partition scans → less I/O + faster query runtime.
Especially useful in lakehouse architectures (Delta Lake partitioned tables).

🔥 Elevator Pitch:
“Dynamic Partition Pruning allows Spark SQL to prune partitions at runtime when filter values come from another query or subquery. For example, filtering a large orders table by a customer’s signup date. Instead of scanning all partitions, Spark only reads the relevant ones. This reduces I/O and significantly improves performance in partitioned tables.”

Top Databricks PySpark and AWS Questions for Senior Data Engineers

Top Client Interview Questions

SCD Type 2 with DLT

SCD Type 2 in Databricks Delta Live Tables (DLT) with PySpark

🔑 How it works

Difference: Spark DataFrame vs Spark SQL

When to Use

Troubleshooting & Optimizing a Slow Spark Job

Complex Data Pipeline Example

Challenges & Solutions

Outcome

What is Watermarking?

How It Works

When I Use It in Databricks

RDD vs DataFrame vs Dataset

1. RDD (Resilient Distributed Dataset)

2. DataFrame

3. Dataset (Scala/Java only)

When to Use Which?

Dynamic Partition Pruning (DPP) in Spark SQL

Concept

How It Works

Scenarios Where DPP Helps

Performance Benefit

Latest Posts

10 Workplace Communication Speaking Exercises to Improve Fluency at Work

Step-by-Step: 4 Methods to Create RDS in AWS & Build Schemas

The End-to-End AI Stack – A Real Guide for Developers to Code, Create, and Execute

Subscribe for DAILY TIPS

Top Databricks PySpark and AWS Questions for Senior Data Engineers

Top Client Interview Questions

SCD Type 2 with DLT

SCD Type 2 in Databricks Delta Live Tables (DLT) with PySpark

🔑 How it works

Difference: Spark DataFrame vs Spark SQL

When to Use

Troubleshooting & Optimizing a Slow Spark Job

Complex Data Pipeline Example

Challenges & Solutions

Outcome

What is Watermarking?

How It Works

When I Use It in Databricks

RDD vs DataFrame vs Dataset

1. RDD (Resilient Distributed Dataset)

2. DataFrame

3. Dataset (Scala/Java only)

When to Use Which?

Dynamic Partition Pruning (DPP) in Spark SQL

Concept

How It Works

Scenarios Where DPP Helps

Performance Benefit

Share this:

Latest Posts

10 Workplace Communication Speaking Exercises to Improve Fluency at Work

Step-by-Step: 4 Methods to Create RDS in AWS & Build Schemas

From Laptop to Cloud: Deploy Your First Production DB Using Amazon RDS

The End-to-End AI Stack – A Real Guide for Developers to Code, Create, and Execute