Spark Structured Streaming vs Spark Declarative Pipelines (SDP): What Every Data Engineer Should Know

If you’ve been building data pipelines on Databricks, you’ve likely used Spark Structured Streaming — the powerful low-level API for real-time data processing. But with the introduction of Spark Declarative Pipelines (SDP), formerly known as Delta Live Tables, Databricks has reimagined how pipelines should be built.

So which one should you use? In this post, we break down both approaches — what they are, how they differ, and when to choose one over the other.

What Is Spark Structured Streaming?

Spark Structured Streaming is Apache Spark’s built-in, low-latency stream processing engine. It extends the familiar DataFrame/Dataset API to support continuous, event-time-based data processing from sources like Kafka, cloud storage, and Delta tables.

Key Characteristics of Structured Streaming

Built on the Spark SQL engine with a DataFrame-based API
Supports micro-batch and continuous processing modes
Manages state, watermarks, and checkpoints manually
Provides fine-grained control over triggers, output modes, and sinks
Suitable for custom, complex streaming logic

Basic Structured Streaming Example

python

			
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json
from pyspark.sql.types import StructType, StringType, DoubleType
spark = SparkSession.builder.getOrCreate()
schema = StructType() \
    .add("event_id", StringType()) \
    .add("amount", DoubleType()) \
    .add("timestamp", StringType())
# Read from Kafka
df = (spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "broker:9092")
    .option("subscribe", "transactions")
    .load()
    .select(from_json(col("value").cast("string"), schema).alias("data"))
    .select("data.*"))
# Apply transformation
filtered = df.filter(col("amount") > 100)
# Write to Delta
query = (filtered.writeStream
    .format("delta")
    .option("checkpointLocation", "/checkpoints/transactions")
    .outputMode("append")
    .table("gold.transactions"))

		

What Is Spark Declarative Pipelines (SDP)?

Spark Declarative Pipelines is Databricks’ higher-level framework for building reliable, production-grade data pipelines — declaratively. You simply define what data transformations you want, and SDP handles how to execute them reliably, efficiently, and at scale.

“Simply declare the data transformations you need — let Spark Declarative Pipelines handle the rest.” — Databricks

Key Characteristics of SDP

Declarative syntax using Python or SQL
Automated dependency management and DAG execution
Built-in data quality enforcement via Expectations
Supports both batch and streaming in a unified pipeline
Auto-manages checkpointing, retries, scaling, and recovery
Integrated with Unity Catalog for governance
Serverless compute with up to 5x better price/performance

Basic SDP Pipeline Example (Python)

Note: The current recommended module is from pyspark import pipelines as dp. The older import dlt is considered legacy and should be avoided in new pipelines.

python

			
from pyspark import pipelines as dp
from pyspark.sql.functions import col
@dp.table(
    name="raw_transactions",
    comment="Raw ingested transactions"
)
def raw_transactions():
    return (spark.readStream
        .format("kafka")
        .option("kafka.bootstrap.servers", "broker:9092")
        .option("subscribe", "transactions")
        .load())
@dp.expect("valid_amount", "amount > 0")
@dp.table(
    name="clean_transactions",
    comment="Validated transactions"
)
def clean_transactions():
    return (dp.read_stream("raw_transactions")
        .filter(col("amount") > 100))
@dp.table(
    name="gold_summary",
    comment="Aggregated daily totals"
)
def gold_summary():
    return (dp.read("clean_transactions")
        .groupBy("date")
        .agg({"amount": "sum"}))

		

SDP with SQL (Even Simpler)

sql

			
-- Bronze layer
CREATE OR REFRESH STREAMING TABLE raw_transactions
AS SELECT * FROM cloud_files("/data/transactions", "json");
-- Silver layer with expectations
CREATE OR REFRESH STREAMING TABLE clean_transactions
  (CONSTRAINT valid_amount EXPECT (amount > 0) ON VIOLATION DROP ROW)
AS SELECT * FROM STREAM(LIVE.raw_transactions)
WHERE amount > 100;
-- Gold layer
CREATE OR REFRESH MATERIALIZED VIEW gold_summary
AS SELECT date, SUM(amount) AS total
FROM LIVE.clean_transactions
GROUP BY date;

		

Head-to-Head Comparison: Structured Streaming vs SDP

Feature	Spark Structured Streaming	Spark Declarative Pipelines (SDP)
Programming model	Imperative (how to do it)	Declarative (what to do)
Language support	Python, Scala, Java, R	Python, SQL
Dependency management	Manual	Automatic DAG
Data quality	Custom code	Built-in Expectations
Checkpointing	Manual setup	Fully automated
Error recovery	Manual retry logic	Seamless auto-recovery
Batch + Streaming	Separate pipelines	Unified in one pipeline
CDC support	Custom implementation	Native AutoCDC API
Observability	Spark UI / custom	Built-in lineage & quality reporting
Governance	Manual	Unity Catalog integrated
Serverless	Not available	Up to 5x price/performance
Best for	Custom, complex logic	Production ETL pipelines

Core Benefits of Spark Declarative Pipelines

Efficient Ingestion

SDP enables efficient ingestion for data engineers, Python developers, data scientists, and SQL analysts. You can load data from any Apache Spark-supported source — whether batch, streaming, or CDC — from cloud storage, message buses, change data feeds, databases, and enterprise apps.

Intelligent Transformation

From just a few lines of code, SDP determines the most efficient way to build and execute your batch or streaming data pipelines — automatically optimizing for cost or performance while minimizing complexity.

Automated Operations

SDP codifies best practices out of the box, automating dependency management, scaling, recovery, data quality rules, and more. Engineers can focus on delivering high-quality data rather than operating and maintaining pipeline infrastructure.

Key Use Cases

Declarative ETL

Use SDP when you need to build batch and real-time ETL pipelines quickly. Declarative programming means you get to harness the power of ETL on Databricks with just a few lines of code.

Change Data Capture (CDC)

SDP simplifies change data capture with the AutoCDC API for change data feeds and database snapshots. It automatically handles out-of-sequence records for SCD Type 1 and 2 — eliminating one of the hardest parts of CDC pipelines.

python

			
from pyspark import pipelines as dp
dp.create_streaming_table("customers")
dp.apply_changes(
    target = "customers",
    source = "raw_customers_cdc",
    keys = ["customer_id"],
    sequence_by = col("updated_at"),
    stored_as_scd_type = 2   # SCD Type 2 handled automatically!
)

		

Streaming Workloads

Build and run your batch and streaming pipelines in one place with controllable and automated refresh settings, reducing operational complexity. Use streaming tables and materialized views for end-to-end incremental processing.

SQL-Based ETL

Data warehouse users get the full power of declarative ETL via an accessible SQL interface. SQL analysts can build infrastructure-free data pipelines without deep Spark knowledge.

When to Choose Structured Streaming vs SDP

Choose Structured Streaming when:

You need fine-grained, custom streaming logic (e.g., complex stateful processing, custom sinks)
You are building on non-Databricks Spark environments (open-source, EMR, HDInsight)
Your team has deep Spark expertise and needs low-level control
You are running Continuous Processing mode with millisecond latency requirements
Your pipeline has non-standard sources or sinks not supported by SDP

Choose Spark Declarative Pipelines (SDP) when:

You want production-ready pipelines with minimal boilerplate
You need built-in data quality, observability, and lineage
Your pipeline follows medallion architecture (Bronze → Silver → Gold)
Your team includes SQL analysts, not just Spark engineers
You want automated recovery, retry, and dependency management
You need native CDC/SCD support
You are building on Databricks and want Unity Catalog governance

More Features in SDP Worth Knowing

Automated Data Quality with Expectations

python

			
from pyspark import pipelines as dp
@dp.expect("non_null_id", "customer_id IS NOT NULL")
@dp.expect_or_drop("positive_amount", "amount > 0")
@dp.expect_or_fail("valid_status", "status IN ('active', 'inactive')")
@dp.table
def validated_orders():
    return dp.read_stream("raw_orders")

		

expect — track violations but keep rows
expect_or_drop — silently drop bad rows
expect_or_fail — fail the pipeline on violations

CI/CD and Version Control

SDP supports isolated development, testing, and production environments — making it easy to apply CI/CD practices to data pipelines without extra tooling.

Pipeline Monitoring and Observability

Built-in features include data lineage, update history, and data quality reporting — all accessible directly from the Databricks workspace UI without any additional setup.

Agent-Ready Pipelines with Genie Code

SDP now integrates with Genie Code, Databricks’ AI assistant, to automate ETL workloads, optimize queries, and build pipelines through natural conversation.

Pricing

SDP uses usage-based pricing — you only pay for what you use at per-second granularity. With serverless compute, Databricks reports up to 5x better price/performance for data ingestion and up to 98% cost savings for complex transformations.

Summary

Both Spark Structured Streaming and Spark Declarative Pipelines are powerful tools — but they serve different needs.

Structured Streaming gives you maximum flexibility and control for custom, complex streaming use cases, especially outside of Databricks.

Spark Declarative Pipelines is the right choice when you want production-ready, maintainable, observable pipelines on Databricks — without wrestling with infrastructure, checkpoints, or boilerplate code.

For most modern data engineering teams building on Databricks, SDP is the recommended path for new pipeline development. It codifies best practices, accelerates delivery, and reduces operational burden — so your team can focus on what matters most: delivering high-quality data to the business.

Latest Posts

Databricks RSA Preparation: Top Interview Questions and Answers (2026 Guide)

July 12, 2026
How to Create a Subagent in Claude Code: A Step-by-Step Guide

July 7, 2026
Migrating from Hive Metastore to Unity Catalog in Databricks: Complete Step-by-Step Guide

June 14, 2026
2-Minute Random Topics to Improve English Speaking Skills

June 14, 2026

What Is Spark Structured Streaming?

Key Characteristics of Structured Streaming

Basic Structured Streaming Example

What Is Spark Declarative Pipelines (SDP)?

Key Characteristics of SDP

Basic SDP Pipeline Example (Python)

SDP with SQL (Even Simpler)

Head-to-Head Comparison: Structured Streaming vs SDP

Core Benefits of Spark Declarative Pipelines

Efficient Ingestion

Intelligent Transformation

Automated Operations

Key Use Cases

Declarative ETL

Change Data Capture (CDC)

Streaming Workloads

SQL-Based ETL

When to Choose Structured Streaming vs SDP

Choose Structured Streaming when:

Choose Spark Declarative Pipelines (SDP) when:

More Features in SDP Worth Knowing

Automated Data Quality with Expectations

CI/CD and Version Control

Pipeline Monitoring and Observability

Agent-Ready Pipelines with Genie Code

Pricing

Summary

Further Reading

Like this:

Fediverse reactions

Latest Posts

Databricks RSA Preparation: Top Interview Questions and Answers (2026 Guide)

How to Create a Subagent in Claude Code: A Step-by-Step Guide

Migrating from Hive Metastore to Unity Catalog in Databricks: Complete Step-by-Step Guide

2-Minute Random Topics to Improve English Speaking Skills

Subscribe for DAILY TIPS

Spark Structured Streaming vs Spark Declarative Pipelines (SDP): What Every Data Engineer Should Know

What Is Spark Structured Streaming?

Key Characteristics of Structured Streaming

Basic Structured Streaming Example

What Is Spark Declarative Pipelines (SDP)?

Key Characteristics of SDP

Basic SDP Pipeline Example (Python)

SDP with SQL (Even Simpler)

Head-to-Head Comparison: Structured Streaming vs SDP

Core Benefits of Spark Declarative Pipelines

Efficient Ingestion

Intelligent Transformation

Automated Operations

Key Use Cases

Declarative ETL

Change Data Capture (CDC)

Streaming Workloads

SQL-Based ETL

When to Choose Structured Streaming vs SDP

Choose Structured Streaming when:

Choose Spark Declarative Pipelines (SDP) when:

More Features in SDP Worth Knowing

Automated Data Quality with Expectations

CI/CD and Version Control

Pipeline Monitoring and Observability

Agent-Ready Pipelines with Genie Code

Pricing

Summary

Further Reading

Share this:

Like this:

Fediverse reactions

Latest Posts

Databricks RSA Preparation: Top Interview Questions and Answers (2026 Guide)

How to Create a Subagent in Claude Code: A Step-by-Step Guide

Migrating from Hive Metastore to Unity Catalog in Databricks: Complete Step-by-Step Guide

2-Minute Random Topics to Improve English Speaking Skills

Discover more from Srinimf