A Comprehensive Guide to PySpark SQL Merge Query

In the world of big data, merging datasets efficiently is a crucial task. PySpark is a powerful distributed computing framework. It provides robust SQL functionality, including the MERGE operation. This operation is a powerful tool for upserting (update or insert) data into Delta tables. This blog post explores the mechanics, use cases, and practical examples of using the MERGE statement in PySpark SQL.

What is a Merge Query?

The MERGE statement is an SQL construct. It lets you conditionally update, delete, or insert rows into a target table. These operations are based on a source table. This operation is handy in scenarios like:

Slowly Changing Dimensions (SCD): Maintaining historical data changes.
Upserting Data: Combining updates and inserts in a single operation.
Data Synchronization: Aligning changes between different datasets.

In PySpark, the MERGE the statement is available for Delta tables. Delta tables are an open-source storage format that brings ACID transactions to Apache Spark.

Make Epic Money A Groundbreaking Book

Prerequisites

Before using the MERGE statement, ensure the following:

Delta Lake Installed: PySpark with Delta Lake capabilities is required.
Delta Table: The target table must be a Delta table.
Spark Session: A Spark session with Delta support must be active.

Syntax of PySpark SQL Merge Query

The basic syntax for a MERGE query in PySpark SQL is:

MERGE INTO target_table AS target
USING source_table AS source
ON merge_condition
WHEN MATCHED THEN
  UPDATE SET target.column1 = source.column1,
             target.column2 = source.column2
WHEN NOT MATCHED THEN
  INSERT (column1, column2, column3)
  VALUES (source.column1, source.column2, source.column3);

Example: Using PySpark SQL Merge Query

Let’s walk through a practical example.

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.