If you’ve been building data pipelines on Databricks, you’ve likely used Spark Structured Streaming — the powerful low-level API for real-time data processing. But with the introduction of Spark Declarative Pipelines (SDP), formerly known as Delta Live Tables, Databricks has reimagined how pipelines should be built.

So which one should you use? In this post, we break down both approaches — what they are, how they differ, and when to choose one over the other.


What Is Spark Structured Streaming?

Spark Structured Streaming is Apache Spark’s built-in, low-latency stream processing engine. It extends the familiar DataFrame/Dataset API to support continuous, event-time-based data processing from sources like Kafka, cloud storage, and Delta tables.

Key Characteristics of Structured Streaming

  • Built on the Spark SQL engine with a DataFrame-based API
  • Supports micro-batch and continuous processing modes
  • Manages state, watermarks, and checkpoints manually
  • Provides fine-grained control over triggers, output modes, and sinks
  • Suitable for custom, complex streaming logic

Basic Structured Streaming Example

python

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json
from pyspark.sql.types import StructType, StringType, DoubleType
spark = SparkSession.builder.getOrCreate()
schema = StructType() \
.add("event_id", StringType()) \
.add("amount", DoubleType()) \
.add("timestamp", StringType())
# Read from Kafka
df = (spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "broker:9092")
.option("subscribe", "transactions")
.load()
.select(from_json(col("value").cast("string"), schema).alias("data"))
.select("data.*"))
# Apply transformation
filtered = df.filter(col("amount") > 100)
# Write to Delta
query = (filtered.writeStream
.format("delta")
.option("checkpointLocation", "/checkpoints/transactions")
.outputMode("append")
.table("gold.transactions"))

What Is Spark Declarative Pipelines (SDP)?

Spark Declarative Pipelines is Databricks’ higher-level framework for building reliable, production-grade data pipelines — declaratively. You simply define what data transformations you want, and SDP handles how to execute them reliably, efficiently, and at scale.

“Simply declare the data transformations you need — let Spark Declarative Pipelines handle the rest.” — Databricks

Key Characteristics of SDP

  • Declarative syntax using Python or SQL
  • Automated dependency management and DAG execution
  • Built-in data quality enforcement via Expectations
  • Supports both batch and streaming in a unified pipeline
  • Auto-manages checkpointing, retries, scaling, and recovery
  • Integrated with Unity Catalog for governance
  • Serverless compute with up to 5x better price/performance

Basic SDP Pipeline Example (Python)

Note: The current recommended module is from pyspark import pipelines as dp. The older import dlt is considered legacy and should be avoided in new pipelines.

python

from pyspark import pipelines as dp
from pyspark.sql.functions import col
@dp.table(
name="raw_transactions",
comment="Raw ingested transactions"
)
def raw_transactions():
return (spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "broker:9092")
.option("subscribe", "transactions")
.load())
@dp.expect("valid_amount", "amount > 0")
@dp.table(
name="clean_transactions",
comment="Validated transactions"
)
def clean_transactions():
return (dp.read_stream("raw_transactions")
.filter(col("amount") > 100))
@dp.table(
name="gold_summary",
comment="Aggregated daily totals"
)
def gold_summary():
return (dp.read("clean_transactions")
.groupBy("date")
.agg({"amount": "sum"}))

SDP with SQL (Even Simpler)

sql

-- Bronze layer
CREATE OR REFRESH STREAMING TABLE raw_transactions
AS SELECT * FROM cloud_files("/data/transactions", "json");
-- Silver layer with expectations
CREATE OR REFRESH STREAMING TABLE clean_transactions
(CONSTRAINT valid_amount EXPECT (amount > 0) ON VIOLATION DROP ROW)
AS SELECT * FROM STREAM(LIVE.raw_transactions)
WHERE amount > 100;
-- Gold layer
CREATE OR REFRESH MATERIALIZED VIEW gold_summary
AS SELECT date, SUM(amount) AS total
FROM LIVE.clean_transactions
GROUP BY date;

Head-to-Head Comparison: Structured Streaming vs SDP

FeatureSpark Structured StreamingSpark Declarative Pipelines (SDP)
Programming modelImperative (how to do it)Declarative (what to do)
Language supportPython, Scala, Java, RPython, SQL
Dependency managementManualAutomatic DAG
Data qualityCustom codeBuilt-in Expectations
CheckpointingManual setupFully automated
Error recoveryManual retry logicSeamless auto-recovery
Batch + StreamingSeparate pipelinesUnified in one pipeline
CDC supportCustom implementationNative AutoCDC API
ObservabilitySpark UI / customBuilt-in lineage & quality reporting
GovernanceManualUnity Catalog integrated
ServerlessNot availableUp to 5x price/performance
Best forCustom, complex logicProduction ETL pipelines

Core Benefits of Spark Declarative Pipelines

Efficient Ingestion

SDP enables efficient ingestion for data engineers, Python developers, data scientists, and SQL analysts. You can load data from any Apache Spark-supported source — whether batch, streaming, or CDC — from cloud storage, message buses, change data feeds, databases, and enterprise apps.

Intelligent Transformation

From just a few lines of code, SDP determines the most efficient way to build and execute your batch or streaming data pipelines — automatically optimizing for cost or performance while minimizing complexity.

Automated Operations

SDP codifies best practices out of the box, automating dependency management, scaling, recovery, data quality rules, and more. Engineers can focus on delivering high-quality data rather than operating and maintaining pipeline infrastructure.


Key Use Cases

Declarative ETL

Use SDP when you need to build batch and real-time ETL pipelines quickly. Declarative programming means you get to harness the power of ETL on Databricks with just a few lines of code.

Change Data Capture (CDC)

SDP simplifies change data capture with the AutoCDC API for change data feeds and database snapshots. It automatically handles out-of-sequence records for SCD Type 1 and 2 — eliminating one of the hardest parts of CDC pipelines.

python

from pyspark import pipelines as dp
dp.create_streaming_table("customers")
dp.apply_changes(
target = "customers",
source = "raw_customers_cdc",
keys = ["customer_id"],
sequence_by = col("updated_at"),
stored_as_scd_type = 2 # SCD Type 2 handled automatically!
)

Streaming Workloads

Build and run your batch and streaming pipelines in one place with controllable and automated refresh settings, reducing operational complexity. Use streaming tables and materialized views for end-to-end incremental processing.

SQL-Based ETL

Data warehouse users get the full power of declarative ETL via an accessible SQL interface. SQL analysts can build infrastructure-free data pipelines without deep Spark knowledge.


When to Choose Structured Streaming vs SDP

Choose Structured Streaming when:

  • You need fine-grained, custom streaming logic (e.g., complex stateful processing, custom sinks)
  • You are building on non-Databricks Spark environments (open-source, EMR, HDInsight)
  • Your team has deep Spark expertise and needs low-level control
  • You are running Continuous Processing mode with millisecond latency requirements
  • Your pipeline has non-standard sources or sinks not supported by SDP

Choose Spark Declarative Pipelines (SDP) when:

  • You want production-ready pipelines with minimal boilerplate
  • You need built-in data quality, observability, and lineage
  • Your pipeline follows medallion architecture (Bronze → Silver → Gold)
  • Your team includes SQL analysts, not just Spark engineers
  • You want automated recovery, retry, and dependency management
  • You need native CDC/SCD support
  • You are building on Databricks and want Unity Catalog governance

More Features in SDP Worth Knowing

Automated Data Quality with Expectations

python

from pyspark import pipelines as dp
@dp.expect("non_null_id", "customer_id IS NOT NULL")
@dp.expect_or_drop("positive_amount", "amount > 0")
@dp.expect_or_fail("valid_status", "status IN ('active', 'inactive')")
@dp.table
def validated_orders():
return dp.read_stream("raw_orders")
  • expect — track violations but keep rows
  • expect_or_drop — silently drop bad rows
  • expect_or_fail — fail the pipeline on violations

CI/CD and Version Control

SDP supports isolated development, testing, and production environments — making it easy to apply CI/CD practices to data pipelines without extra tooling.

Pipeline Monitoring and Observability

Built-in features include data lineage, update history, and data quality reporting — all accessible directly from the Databricks workspace UI without any additional setup.

Agent-Ready Pipelines with Genie Code

SDP now integrates with Genie Code, Databricks’ AI assistant, to automate ETL workloads, optimize queries, and build pipelines through natural conversation.


Pricing

SDP uses usage-based pricing — you only pay for what you use at per-second granularity. With serverless compute, Databricks reports up to 5x better price/performance for data ingestion and up to 98% cost savings for complex transformations.


Summary

Both Spark Structured Streaming and Spark Declarative Pipelines are powerful tools — but they serve different needs.

Structured Streaming gives you maximum flexibility and control for custom, complex streaming use cases, especially outside of Databricks.

Spark Declarative Pipelines is the right choice when you want production-ready, maintainable, observable pipelines on Databricks — without wrestling with infrastructure, checkpoints, or boilerplate code.

For most modern data engineering teams building on Databricks, SDP is the recommended path for new pipeline development. It codifies best practices, accelerates delivery, and reduces operational burden — so your team can focus on what matters most: delivering high-quality data to the business.


Further Reading

Fediverse reactions

Leave a Reply

Discover more from Srinimf

Subscribe now to keep reading and get access to the full archive.

Continue reading