Mastering Union in Databricks – Combining Data Efficiently

Here is a complete story of UNION in Databricks. When working with large datasets in Databricks, you may need to merge data from different sources. The UNION operation helps you stack datasets consistently. In this post, we’ll look at UNION in Databricks, its variations, and tips to improve performance.

What is UNION in Databricks?

The UNION operation in Databricks (Apache Spark) merges two or more DataFrames with the same structure. It works like SQL’s UNION ALL but is designed for distributed datasets in PySpark.

Types of UNION Operations

UNION (Equivalent to UNION ALL in SQL)
- Stacks two DataFrames but does not remove duplicates.
- More efficient since it avoids the costly deduplication step.
UNION DISTINCT
- Removes duplicate records after stacking DataFrames.
- Equivalent to SQL’s UNION.
- More resource-intensive since it requires a shuffle operation.

Azure Databricks & Spark For Data Engineers:Hands-on Project

Using UNION in Databricks (PySpark Examples)

Let’s consider two employee DataFrames with the same schema.

1. UNION (Combine Without Deduplication)

from pyspark.sql import SparkSession 
from pyspark.sql.types import StructType, StructField, IntegerType, StringType 

# Initialize Spark session spark = SparkSession.builder.appName("UnionExample").getOrCreate() 

# Define schema schema = StructType([ StructField("id", IntegerType(), True), StructField("name", StringType(), True), StructField("department", StringType(), True) ]) 

# Create two DataFrames 
df1 = spark.createDataFrame([(1, "Alice", "HR"), (2, "Bob", "IT")], schema) 
df2 = spark.createDataFrame([(2, "Bob", "IT"), (3, "Charlie", "Finance")], schema) 

# Perform UNION (equivalent to UNION ALL in SQL) 
union_df = df1.union(df2) union_df.show()

Output:

+---+-------+----------+ | id| name |department| +---+-------+----------+ | 1| Alice | HR | | 2| Bob | IT | | 2| Bob | IT | | 3|Charlie| Finance | +---+-------+----------+

Here, duplicate rows (Bob from IT) are retained.

2. UNION DISTINCT (Combine With Deduplication)

union_distinct_df = df1.union(df2).distinct() 
union_distinct_df.show()

Output:

+---+-------+----------+ | id| name |department| +---+-------+----------+ | 1| Alice | HR | | 2| Bob | IT | | 3|Charlie| Finance | +---+-------+----------+

This removes duplicate rows, mimicking SQL’s UNION DISTINCT.

Best selling Shoes and Bags

Key Considerations When Using UNION in Databricks

1. Data Schema Must Match

Column names, order, and data types must be identical in all DataFrames.
If schemas differ, align them before applying UNION:

from pyspark.sql.functions import lit 
df1 = df1.withColumn("salary", lit(None).cast("int")) 
df2 = df2.withColumn("salary", lit(None).cast("int")) 
aligned_union_df = df1.union(df2)

2. Performance Considerations

Prefer union() over distinct() unless necessary to avoid expensive shuffling.
If deduplication is required, consider using dropDuplicates() instead of distinct() for better control:

union_df.dropDuplicates(["id"]).show()

Optimize storage format (e.g., Parquet) to improve performance.

When to Use UNION in Databricks

In Databricks, you should use UNION when you need to combine the results from multiple datasets while ensuring they have the same number of columns and compatible data types.

Conclusion

Databricks provides powerful ways to combine datasets using UNION. By understanding the differences between union() and union().distinct(), you can optimize performance while maintaining data integrity.

Avoid unnecessary deduplication unless required for large datasets, and always ensure schema alignment for smooth UNION operations.

Have questions or a unique UNION use case? Drop them in the comments below!

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.