What is UNION in Databricks?

The UNION operation in Databricks (Apache Spark) merges two or more DataFrames with the same structure. It works like SQL’s UNION ALL but is designed for distributed datasets in PySpark.

Types of UNION Operations

  1. UNION (Equivalent to UNION ALL in SQL)
    • Stacks two DataFrames but does not remove duplicates.
    • More efficient since it avoids the costly deduplication step.
  2. UNION DISTINCT
    • Removes duplicate records after stacking DataFrames.
    • Equivalent to SQL’s UNION.
    • More resource-intensive since it requires a shuffle operation.

Using UNION in Databricks (PySpark Examples)

Let’s consider two employee DataFrames with the same schema.

1. UNION (Combine Without Deduplication)

from pyspark.sql import SparkSession 
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

# Initialize Spark session spark = SparkSession.builder.appName("UnionExample").getOrCreate()

# Define schema schema = StructType([ StructField("id", IntegerType(), True), StructField("name", StringType(), True), StructField("department", StringType(), True) ])

# Create two DataFrames
df1 = spark.createDataFrame([(1, "Alice", "HR"), (2, "Bob", "IT")], schema)
df2 = spark.createDataFrame([(2, "Bob", "IT"), (3, "Charlie", "Finance")], schema)

# Perform UNION (equivalent to UNION ALL in SQL)
union_df = df1.union(df2) union_df.show()

Output:

+---+-------+----------+ | id| name |department| +---+-------+----------+ | 1| Alice | HR | | 2| Bob | IT | | 2| Bob | IT | | 3|Charlie| Finance | +---+-------+----------+

Here, duplicate rows (Bob from IT) are retained.

2. UNION DISTINCT (Combine With Deduplication)

union_distinct_df = df1.union(df2).distinct() 
union_distinct_df.show()

Output:

+---+-------+----------+ | id| name |department| +---+-------+----------+ | 1| Alice | HR | | 2| Bob | IT | | 3|Charlie| Finance | +---+-------+----------+

This removes duplicate rows, mimicking SQL’s UNION DISTINCT.

Key Considerations When Using UNION in Databricks

1. Data Schema Must Match

  • Column names, order, and data types must be identical in all DataFrames.
  • If schemas differ, align them before applying UNION:
from pyspark.sql.functions import lit 
df1 = df1.withColumn("salary", lit(None).cast("int")) 
df2 = df2.withColumn("salary", lit(None).cast("int")) 
aligned_union_df = df1.union(df2)

2. Performance Considerations

  • Prefer union() over distinct() unless necessary to avoid expensive shuffling.
  • If deduplication is required, consider using dropDuplicates() instead of distinct() for better control:
union_df.dropDuplicates(["id"]).show()
  • Optimize storage format (e.g., Parquet) to improve performance.

When to Use UNION in Databricks

In Databricks, you should use UNION when you need to combine the results from multiple datasets while ensuring they have the same number of columns and compatible data types.

Conclusion

Databricks provides powerful ways to combine datasets using UNION. By understanding the differences between union() and union().distinct(), you can optimize performance while maintaining data integrity.

Avoid unnecessary deduplication unless required for large datasets, and always ensure schema alignment for smooth UNION operations.

Have questions or a unique UNION use case? Drop them in the comments below!