Here is a complete story of UNION in Databricks. When working with large datasets in Databricks, you may need to merge data from different sources. The UNION operation helps you stack datasets consistently. In this post, we’ll look at UNION in Databricks, its variations, and tips to improve performance.
What is UNION in Databricks?
The UNION operation in Databricks (Apache Spark) merges two or more DataFrames with the same structure. It works like SQL’s UNION ALL but is designed for distributed datasets in PySpark.
Types of UNION Operations
- UNION (Equivalent to UNION ALL in SQL)
- Stacks two DataFrames but does not remove duplicates.
- More efficient since it avoids the costly deduplication step.
- UNION DISTINCT
- Removes duplicate records after stacking DataFrames.
- Equivalent to SQL’s UNION.
- More resource-intensive since it requires a shuffle operation.
Using UNION in Databricks (PySpark Examples)
Let’s consider two employee DataFrames with the same schema.
1. UNION (Combine Without Deduplication)
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
# Initialize Spark session spark = SparkSession.builder.appName("UnionExample").getOrCreate()
# Define schema schema = StructType([ StructField("id", IntegerType(), True), StructField("name", StringType(), True), StructField("department", StringType(), True) ])
# Create two DataFrames
df1 = spark.createDataFrame([(1, "Alice", "HR"), (2, "Bob", "IT")], schema)
df2 = spark.createDataFrame([(2, "Bob", "IT"), (3, "Charlie", "Finance")], schema)
# Perform UNION (equivalent to UNION ALL in SQL)
union_df = df1.union(df2) union_df.show()
Output:
+---+-------+----------+ | id| name |department| +---+-------+----------+ | 1| Alice | HR | | 2| Bob | IT | | 2| Bob | IT | | 3|Charlie| Finance | +---+-------+----------+
Here, duplicate rows (Bob from IT) are retained.
2. UNION DISTINCT (Combine With Deduplication)
union_distinct_df = df1.union(df2).distinct()
union_distinct_df.show()
Output:
+---+-------+----------+ | id| name |department| +---+-------+----------+ | 1| Alice | HR | | 2| Bob | IT | | 3|Charlie| Finance | +---+-------+----------+
This removes duplicate rows, mimicking SQL’s UNION DISTINCT.
Key Considerations When Using UNION in Databricks
1. Data Schema Must Match
- Column names, order, and data types must be identical in all DataFrames.
- If schemas differ, align them before applying UNION:
from pyspark.sql.functions import lit
df1 = df1.withColumn("salary", lit(None).cast("int"))
df2 = df2.withColumn("salary", lit(None).cast("int"))
aligned_union_df = df1.union(df2)
2. Performance Considerations
- Prefer
union()overdistinct()unless necessary to avoid expensive shuffling. - If deduplication is required, consider using
dropDuplicates()instead ofdistinct()for better control:
union_df.dropDuplicates(["id"]).show()
- Optimize storage format (e.g., Parquet) to improve performance.
When to Use UNION in Databricks
In Databricks, you should use UNION when you need to combine the results from multiple datasets while ensuring they have the same number of columns and compatible data types.
Conclusion
Databricks provides powerful ways to combine datasets using UNION. By understanding the differences between union() and union().distinct(), you can optimize performance while maintaining data integrity.
Avoid unnecessary deduplication unless required for large datasets, and always ensure schema alignment for smooth UNION operations.
Have questions or a unique UNION use case? Drop them in the comments below!








You must be logged in to post a comment.