Modern data engineering requires handling both real-time streaming data and traditional batch workloads efficiently. Databricks has evolved Delta Live Tables (DLT) into Lakeflow Declarative Pipelines, enabling data engineers to build reliable and scalable pipelines with less code.

Using the dp module, developers can define streaming tables and materialized views declaratively. Databricks automatically manages dependencies, execution order, monitoring, and recovery.

In this guide, you’ll learn how to use @dp.table for streaming pipelines and @dp.materialized_view for batch processing with practical examples.

What is dp in Databricks?

The dp module stands for Declarative Pipelines and is imported as:

from pyspark import pipelines as dp

Instead of manually orchestrating ETL jobs, you define datasets and transformations. Databricks handles the pipeline execution behind the scenes.

Benefits of Using dp

  • Simplified pipeline development
  • Automatic dependency tracking
  • Built-in monitoring and observability
  • Support for both streaming and batch processing
  • Easy implementation of Medallion Architecture
  • Reduced operational complexity

Understanding Streaming Tables

Streaming tables continuously process incoming data as it arrives.

Common streaming sources include:

  • Amazon S3
  • Azure Data Lake Storage
  • Google Cloud Storage
  • Apache Kafka
  • Change Data Capture (CDC) streams

Example: Creating a Bronze Streaming Table

Suppose CSV files are continuously arriving in cloud storage.

from pyspark import pipelines as dp
@dp.table(
name="bronze_orders"
)
def bronze_orders():
return (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "csv")
.load("/Volumes/raw/orders/")
)

How It Works

  • spark.readStream enables streaming mode.
  • Auto Loader detects new files automatically.
  • Data is processed incrementally.
  • A streaming table named bronze_orders is created.

Building a Silver Streaming Layer

After ingesting raw data, the next step is cleansing and validation.

Example: Bronze to Silver Transformation

from pyspark import pipelines as dp
@dp.table(
name="silver_orders"
)
def silver_orders():
return (
spark.readStream.table("bronze_orders")
.filter("order_amount > 0")
)

Purpose of the Silver Layer

  • Remove invalid records
  • Standardize data formats
  • Apply business rules
  • Improve data quality

Understanding Materialized Views

Materialized Views are optimized for batch processing and reporting workloads.

Unlike streaming tables, materialized views store precomputed results that can be refreshed automatically when source data changes.

Example: Creating a Gold Layer Materialized View

from pyspark import pipelines as dp
@dp.materialized_view(
name="daily_sales"
)
def daily_sales():
return (
spark.table("silver_orders")
.groupBy("order_date")
.sum("order_amount")
)

What Happens?

  • Reads clean data from the Silver layer
  • Aggregates sales by date
  • Stores precomputed results
  • Improves dashboard and reporting performance

Complete Medallion Architecture Example

The following example demonstrates a complete Bronze-Silver-Gold pipeline.

Pipeline Code

from pyspark import pipelines as dp
# Bronze Layer
@dp.table(name="bronze_orders")
def bronze_orders():
return (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "csv")
.load("/Volumes/raw/orders/")
)
# Silver Layer
@dp.table(name="silver_orders")
def silver_orders():
return (
spark.readStream.table("bronze_orders")
.filter("order_amount > 0")
)
# Gold Layer
@dp.materialized_view(name="daily_sales")
def daily_sales():
return (
spark.table("silver_orders")
.groupBy("order_date")
.sum("order_amount")
)

Data Flow

Raw Files
Bronze (Streaming)
Silver (Streaming)
Gold (Materialized View)

Working with Multiple Streaming Sources

Organizations often receive data from multiple locations or business units.

Example: Combining US and EU Streams

from pyspark import pipelines as dp
dp.create_streaming_table("all_orders")
@dp.append_flow(target="all_orders")
def us_orders():
return (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "csv")
.load("/data/us/")
)
@dp.append_flow(target="all_orders")
def eu_orders():
return (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "csv")
.load("/data/eu/")
)

Advantages

  • Consolidates multiple sources
  • Simplifies downstream processing
  • Supports scalable ingestion patterns

How to Run a Lakeflow Declarative Pipeline

Step 1: Create a Notebook

Create a Databricks notebook and add your pipeline code.

from pyspark import pipelines as dp

Step 2: Create a Pipeline

Navigate to:

Workflows
→ Lakeflow Declarative Pipelines
→ Create Pipeline

Step 3: Configure Pipeline Settings

Provide the following information:

  • Notebook Path
  • Catalog
  • Schema
  • Storage Location

Example:

Catalog : demo
Schema : bronze

Step 4: Select an Execution Mode

Triggered Mode

Runs the pipeline once and then stops.

Best suited for:

  • Daily ETL jobs
  • Scheduled reports
  • Batch processing

Continuous Mode

Runs continuously and processes incoming data automatically.

Best suited for:

  • Real-time analytics
  • Event-driven architectures
  • Streaming applications

Streaming Tables vs Materialized Views

FeatureStreaming TableMaterialized View
Read Methodspark.readStreamspark.table
Processing TypeStreamingBatch
Decorator@dp.table@dp.materialized_view
Real-Time UpdatesYesNo
Reporting WorkloadsLimitedExcellent
AggregationsSupportedRecommended

Best Practices for Databricks DLT Pipelines

Use the Medallion Architecture

Separate your data into:

  • Bronze (Raw Data)
  • Silver (Cleaned Data)
  • Gold (Business-Ready Data)

Keep Transformations Modular

Create smaller pipeline functions rather than one large transformation.

Use Materialized Views for Reporting

Aggregations and dashboard datasets perform better when stored as materialized views.

Monitor Pipeline Health

Leverage Databricks pipeline monitoring to identify failures and performance bottlenecks quickly.

Conclusion

Databricks Lakeflow Declarative Pipelines simplify modern data engineering by providing a declarative framework for building both streaming and batch pipelines. Using @dp.table for streaming ingestion and @dp.materialized_view for batch transformations enables teams to develop scalable, maintainable, and production-ready data pipelines with minimal operational effort.

By adopting these patterns, organizations can implement robust Medallion Architectures and accelerate their journey toward real-time analytics and data-driven decision-making.

Start Discussion

This site uses Akismet to reduce spam. Learn how your comment data is processed.