If you’ve been building data pipelines on Databricks, you’ve likely used Spark Structured Streaming — the powerful low-level API for real-time data processing. But with the introduction of Spark Declarative Pipelines (SDP), formerly known as Delta Live Tables, Databricks has reimagined how pipelines should be built.
So which one should you use? In this post, we break down both approaches — what they are, how they differ, and when to choose one over the other.
What Is Spark Structured Streaming?
Spark Structured Streaming is Apache Spark’s built-in, low-latency stream processing engine. It extends the familiar DataFrame/Dataset API to support continuous, event-time-based data processing from sources like Kafka, cloud storage, and Delta tables.
Key Characteristics of Structured Streaming
- Built on the Spark SQL engine with a DataFrame-based API
- Supports micro-batch and continuous processing modes
- Manages state, watermarks, and checkpoints manually
- Provides fine-grained control over triggers, output modes, and sinks
- Suitable for custom, complex streaming logic
Basic Structured Streaming Example
python
from pyspark.sql import SparkSessionfrom pyspark.sql.functions import col, from_jsonfrom pyspark.sql.types import StructType, StringType, DoubleTypespark = SparkSession.builder.getOrCreate()schema = StructType() \ .add("event_id", StringType()) \ .add("amount", DoubleType()) \ .add("timestamp", StringType())# Read from Kafkadf = (spark.readStream .format("kafka") .option("kafka.bootstrap.servers", "broker:9092") .option("subscribe", "transactions") .load() .select(from_json(col("value").cast("string"), schema).alias("data")) .select("data.*"))# Apply transformationfiltered = df.filter(col("amount") > 100)# Write to Deltaquery = (filtered.writeStream .format("delta") .option("checkpointLocation", "/checkpoints/transactions") .outputMode("append") .table("gold.transactions"))
What Is Spark Declarative Pipelines (SDP)?
Spark Declarative Pipelines is Databricks’ higher-level framework for building reliable, production-grade data pipelines — declaratively. You simply define what data transformations you want, and SDP handles how to execute them reliably, efficiently, and at scale.
“Simply declare the data transformations you need — let Spark Declarative Pipelines handle the rest.” — Databricks
Key Characteristics of SDP
- Declarative syntax using Python or SQL
- Automated dependency management and DAG execution
- Built-in data quality enforcement via Expectations
- Supports both batch and streaming in a unified pipeline
- Auto-manages checkpointing, retries, scaling, and recovery
- Integrated with Unity Catalog for governance
- Serverless compute with up to 5x better price/performance
Basic SDP Pipeline Example (Python)
Note: The current recommended module is
from pyspark import pipelines as dp. The olderimport dltis considered legacy and should be avoided in new pipelines.
python
from pyspark import pipelines as dpfrom pyspark.sql.functions import coldp.table( name="raw_transactions", comment="Raw ingested transactions")def raw_transactions(): return (spark.readStream .format("kafka") .option("kafka.bootstrap.servers", "broker:9092") .option("subscribe", "transactions") .load())dp.expect("valid_amount", "amount > 0")dp.table( name="clean_transactions", comment="Validated transactions")def clean_transactions(): return (dp.read_stream("raw_transactions") .filter(col("amount") > 100))dp.table( name="gold_summary", comment="Aggregated daily totals")def gold_summary(): return (dp.read("clean_transactions") .groupBy("date") .agg({"amount": "sum"}))
SDP with SQL (Even Simpler)
sql
-- Bronze layerCREATE OR REFRESH STREAMING TABLE raw_transactionsAS SELECT * FROM cloud_files("/data/transactions", "json");-- Silver layer with expectationsCREATE OR REFRESH STREAMING TABLE clean_transactions (CONSTRAINT valid_amount EXPECT (amount > 0) ON VIOLATION DROP ROW)AS SELECT * FROM STREAM(LIVE.raw_transactions)WHERE amount > 100;-- Gold layerCREATE OR REFRESH MATERIALIZED VIEW gold_summaryAS SELECT date, SUM(amount) AS totalFROM LIVE.clean_transactionsGROUP BY date;
Head-to-Head Comparison: Structured Streaming vs SDP
| Feature | Spark Structured Streaming | Spark Declarative Pipelines (SDP) |
|---|---|---|
| Programming model | Imperative (how to do it) | Declarative (what to do) |
| Language support | Python, Scala, Java, R | Python, SQL |
| Dependency management | Manual | Automatic DAG |
| Data quality | Custom code | Built-in Expectations |
| Checkpointing | Manual setup | Fully automated |
| Error recovery | Manual retry logic | Seamless auto-recovery |
| Batch + Streaming | Separate pipelines | Unified in one pipeline |
| CDC support | Custom implementation | Native AutoCDC API |
| Observability | Spark UI / custom | Built-in lineage & quality reporting |
| Governance | Manual | Unity Catalog integrated |
| Serverless | Not available | Up to 5x price/performance |
| Best for | Custom, complex logic | Production ETL pipelines |
Core Benefits of Spark Declarative Pipelines
Efficient Ingestion
SDP enables efficient ingestion for data engineers, Python developers, data scientists, and SQL analysts. You can load data from any Apache Spark-supported source — whether batch, streaming, or CDC — from cloud storage, message buses, change data feeds, databases, and enterprise apps.
Intelligent Transformation
From just a few lines of code, SDP determines the most efficient way to build and execute your batch or streaming data pipelines — automatically optimizing for cost or performance while minimizing complexity.
Automated Operations
SDP codifies best practices out of the box, automating dependency management, scaling, recovery, data quality rules, and more. Engineers can focus on delivering high-quality data rather than operating and maintaining pipeline infrastructure.
Key Use Cases
Declarative ETL
Use SDP when you need to build batch and real-time ETL pipelines quickly. Declarative programming means you get to harness the power of ETL on Databricks with just a few lines of code.
Change Data Capture (CDC)
SDP simplifies change data capture with the AutoCDC API for change data feeds and database snapshots. It automatically handles out-of-sequence records for SCD Type 1 and 2 — eliminating one of the hardest parts of CDC pipelines.
python
from pyspark import pipelines as dpdp.create_streaming_table("customers")dp.apply_changes( target = "customers", source = "raw_customers_cdc", keys = ["customer_id"], sequence_by = col("updated_at"), stored_as_scd_type = 2 # SCD Type 2 handled automatically!)
Streaming Workloads
Build and run your batch and streaming pipelines in one place with controllable and automated refresh settings, reducing operational complexity. Use streaming tables and materialized views for end-to-end incremental processing.
SQL-Based ETL
Data warehouse users get the full power of declarative ETL via an accessible SQL interface. SQL analysts can build infrastructure-free data pipelines without deep Spark knowledge.
When to Choose Structured Streaming vs SDP
Choose Structured Streaming when:
- You need fine-grained, custom streaming logic (e.g., complex stateful processing, custom sinks)
- You are building on non-Databricks Spark environments (open-source, EMR, HDInsight)
- Your team has deep Spark expertise and needs low-level control
- You are running Continuous Processing mode with millisecond latency requirements
- Your pipeline has non-standard sources or sinks not supported by SDP
Choose Spark Declarative Pipelines (SDP) when:
- You want production-ready pipelines with minimal boilerplate
- You need built-in data quality, observability, and lineage
- Your pipeline follows medallion architecture (Bronze → Silver → Gold)
- Your team includes SQL analysts, not just Spark engineers
- You want automated recovery, retry, and dependency management
- You need native CDC/SCD support
- You are building on Databricks and want Unity Catalog governance
More Features in SDP Worth Knowing
Automated Data Quality with Expectations
python
from pyspark import pipelines as dpdp.expect("non_null_id", "customer_id IS NOT NULL")dp.expect_or_drop("positive_amount", "amount > 0")dp.expect_or_fail("valid_status", "status IN ('active', 'inactive')")dp.tabledef validated_orders(): return dp.read_stream("raw_orders")
expect— track violations but keep rowsexpect_or_drop— silently drop bad rowsexpect_or_fail— fail the pipeline on violations
CI/CD and Version Control
SDP supports isolated development, testing, and production environments — making it easy to apply CI/CD practices to data pipelines without extra tooling.
Pipeline Monitoring and Observability
Built-in features include data lineage, update history, and data quality reporting — all accessible directly from the Databricks workspace UI without any additional setup.
Agent-Ready Pipelines with Genie Code
SDP now integrates with Genie Code, Databricks’ AI assistant, to automate ETL workloads, optimize queries, and build pipelines through natural conversation.
Pricing
SDP uses usage-based pricing — you only pay for what you use at per-second granularity. With serverless compute, Databricks reports up to 5x better price/performance for data ingestion and up to 98% cost savings for complex transformations.
Summary
Both Spark Structured Streaming and Spark Declarative Pipelines are powerful tools — but they serve different needs.
Structured Streaming gives you maximum flexibility and control for custom, complex streaming use cases, especially outside of Databricks.
Spark Declarative Pipelines is the right choice when you want production-ready, maintainable, observable pipelines on Databricks — without wrestling with infrastructure, checkpoints, or boilerplate code.
For most modern data engineering teams building on Databricks, SDP is the recommended path for new pipeline development. It codifies best practices, accelerates delivery, and reduces operational burden — so your team can focus on what matters most: delivering high-quality data to the business.
Further Reading
- Official Spark Declarative Pipelines Documentation
- Databricks: Announcing Spark Declarative Pipelines
- Databricks Lakeflow is Generally Available
- Big Book of Data Engineering






Leave a Reply