DLT in Databricks: Event Logs, Pipelines, and Duplicate Handling Explained

Delta Live Tables (DLT) in Databricks is a modern, fully managed framework for building reliable ETL pipelines that support both streaming and batch data. It simplifies data engineering by enabling automated data quality checks, event logging, and pipeline monitoring. With DLT, you can create tables, views, and expectations from notebooks or SQL scripts, track pipeline execution through event logs, and efficiently handle duplicates and streaming data. Whether using Core or Pro features, DLT helps teams maintain production-ready pipelines with lineage, auditing, and governance, making it ideal for modern data workflows.

1️⃣ Event Logs

Q: How many event logs do we need?
A: Each DLT pipeline automatically generates one event log.

10 pipelines → 10 event logs.
Stored under: dbfs:/pipelines/<pipeline_name>/system/events

Q: Can I create a centralized event log?
A: ✅ Yes. You can union multiple event logs to create a centralized monitoring view:

paths = [
    "dbfs:/pipelines/pipeline1/system/events",
    "dbfs:/pipelines/pipeline2/system/events"
]
combined = spark.read.format("delta").load(paths)
combined.createOrReplaceTempView("all_event_logs")

Q: How to query the events folder?
A:

pipeline_name = "silver_orders_pipeline"
events_path = f"dbfs:/pipelines/{pipeline_name}/system/events"
event_log_df = spark.read.format("delta").load(events_path)
display(event_log_df)

Filter by level for ERROR/WARN, or explode expectations to check failed rules.

Q: What schema does the event log have?

Column	Description
timestamp	Event time
level	INFO/WARN/ERROR
pipeline_id	Pipeline identifier
pipeline_name	Name of pipeline
batch_id	Batch number
origin	Source table/stream info
target	Target table info
message	Event description
expectations	Array of data quality results
metrics	Map of processed rows, failures, latency

Q: What is event_log() in SQL?

event_log() is a built-in DLT SQL function to query the event log for a table or pipeline:

SELECT * FROM event_log(TABLE(`catalog`.`schema`.`table`))

Returns the same schema as above.

2️⃣ DLT Pipelines and Assets

Q: What types of assets can a DLT pipeline take?

Asset Type	Description	Example
Notebook	Python/SQL notebook defining tables & expectations	`/Repos/dlt_pipelines/orders_pipeline`
Python file/module	`.py` file with DLT table definitions	`/Workspace/dlt/code/orders_pipeline.py`
SQL Script	SQL notebook with `CREATE LIVE TABLE` statements	`/Repos/dlt_sql/orders_pipeline.sql`
Delta Table references	Existing Delta tables as sources	`dlt.read("bronze_orders")`
Config files	JSON or YAML for environment parameters	`/Workspace/dlt/config/prod_config.json`

Q: DLT object means?

Any entity managed/tracked by DLT:
- Tables (@dlt.table)
- Views (@dlt.view)
- Pipelines
- Expectations
- Event logs

Q: DLT options – Core vs Pro?

Tier	Features
Core	Batch & streaming pipelines, tables/views, expectations, event logs, schema evolution
Pro	Core + continuous streaming, REST API, versioning, Unity Catalog integration, SLA/alerting, centralized monitoring

3️⃣ DLT Data Handling

Q: DLT duplicates handling?

By default, duplicates are not removed in append mode.
Strategies to handle duplicates:
1. Drop duplicates:

dlt.read("bronze_orders").dropDuplicates(["order_id"])

MERGE/Upsert for idempotent pipelines
Expectations to detect duplicates:

@dlt.expect("no_duplicate_orders", "count_distinct(order_id) = count(*)")

Streaming watermarks to handle late-arriving duplicates

Q: Can DLT handle duplicates automatically?

❌ Not automatically — you must define deduplication logic (dropDuplicates or merge)
✅ DLT tracks duplicates via expectations in event logs

Q: DLT append flow?

Source data → dlt.read() / dlt.read_stream()
Transformations & expectations
Append to target table via @dlt.table
Event logs capture batch info and DQ metrics
Downstream consumers read appended data

✅ 4️⃣ Quick Reference Examples

Python DLT table (Append + Expectation):

import dlt
from pyspark.sql.functions import col

@dlt.table
@dlt.expect("positive_amount", "amount > 0")
def silver_orders():
    return dlt.read_stream("bronze_orders").dropDuplicates(["order_id"])

SQL DLT table example:

CREATE LIVE TABLE silver_orders
COMMENT "Deduplicated orders"
AS
SELECT DISTINCT *
FROM LIVE.bronze_orders
WHERE amount > 0;

Query event log SQL example:

SELECT *
FROM event_log(TABLE(`prod_catalog`.`sales`.`silver_orders`))
WHERE level IN ('ERROR','WARN')
ORDER BY timestamp DESC;

Centralized event log view (Python):

paths = [
    "dbfs:/pipelines/pipeline1/system/events",
    "dbfs:/pipelines/pipeline2/system/events"
]
combined = spark.read.format("delta").load(paths)
combined.createOrReplaceTempView("all_event_logs")

This compilation covers all the questions you asked today on:

Event logs (event_log(), schema, querying)
Pipeline assets (notebooks, SQL, Python)
DLT objects (tables, views, expectations)
DLT tiers (Core vs Pro)
Duplicate handling and append flow

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.