How to Perform Semi and Anti Joins in SQL and PySpark: Explained with Examples

Here is the code to execute Semi and Anti Joins in SQL and PySpark. These join operations filter rows based on the presence or absence of matching rows in another table. Let’s understand each with real-world examples using MySQL and PySpark. You can find more information here.

Table of contents

SQL: Semi Vs. Anti Joins
1. Semi-Join(Left Distinct Match)
2. Anti-Join(Left non-match)
PySpark: Semi Vs. Anti Joins
1. Semi-Join in PySpark
2. Anti-Join in PySpark

SQL: Semi Vs. Anti Joins

Semi-Join(Left Distinct Match)

A semi-join returns rows from the first table (LEFT). It matches where at least one matching row (Distinct) exists in the second table (RIGHT). It’s like an existence check.
In MySQL, you can achieve this using the EXISTS or IN subquery. Another method is to join the tables and select distinct rows from the left table.

Example:

Let’s say we have two tables: orders and customers. We want to find all customers who have placed at least one order.

SELECT *
FROM customers c
WHERE EXISTS (
    SELECT 1
    FROM orders o
    WHERE o.customer_id = c.customer_id
);

Anti-Join(Left non-match)

An anti-join, on the other hand, returns rows from the first table (LEFT). These rows have no matching rows in the second table (RIGHT). It’s the opposite of a semi-join.

Example:

Continuing with the same example, let’s find all customers who have not placed any orders.

SELECT *
FROM customers c
WHERE NOT EXISTS (
    SELECT 1
    FROM orders o
    WHERE o.customer_id = c.customer_id
);

In these examples:

The orders table includes information about orders, including the customer_id.
The customers table holds information about customers.
We use a correlated subquery within the EXISTS clause to check for the existence of orders placed by each customer.
We negate the condition to find customers with no matching orders for the anti-join.

These queries show how to use semi-join and anti-join in MySQL. These techniques filter rows based on the existence or absence of matching rows in another table.

PySpark: Semi Vs. Anti Joins

The concepts are also applicable in PySpark, and you can achieve similar functionality using PySpark’s DataFrame operations. Let’s illustrate each with a real-world example.

Semi-Join in PySpark

A semi-join in PySpark returns rows from the first DataFrame (left). It does so where at least one matching row exists in the second DataFrame (right).
You can achieve this using a variety of techniques. For example, you can use it followed by a distinct choice. You can also use the isin() filter or the exists() subquery.

Example:

Let’s say we have two DataFrames: orders_df and customers_df. We want to find all customers who have placed at least one order.

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
    .appName("SemiJoinExample") \
    .getOrCreate()

# Sample data for orders and customers
data_customers = [("C001", "Alice"), ("C002", "Bob"), ("C003", "Charlie")]
data_orders = [("O001", "C001"), ("O002", "C002"), ("O003", "C004")]

# Create DataFrames
customers_df = spark.createDataFrame(data_customers, ["customer_id", "customer_name"])
orders_df = spark.createDataFrame(data_orders, ["order_id", "customer_id"])

# Performing a semi-join: Selecting only customers who have placed orders
semi_join_df = customers_df.join(orders_df, 
                                 on="customer_id", 
                                 how="left_semi")

# Show the result of the semi-join
semi_join_df.show()

Anti-Join in PySpark

An anti-join in PySpark returns rows from the first DataFrame (left). These rows have no matching row in the second DataFrame (right). This can be achieved using various techniques. One technique is join() with a left anti-join. Other approaches include the subtract(), or isin() filter mixed with a negation.

Example:

Continuing with the same example, let’s find all customers who have not placed any orders.

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
    .appName("AntiJoinExample") \
    .getOrCreate()

# Sample data for customers and orders
data_customers = [("C001", "Alice"), ("C002", "Bob"), ("C003", "Charlie"), ("C004", "David")]
data_orders = [("O001", "C001"), ("O002", "C002"), ("O003", "C004")]

# Create DataFrames
customers_df = spark.createDataFrame(data_customers, ["customer_id", "customer_name"])
orders_df = spark.createDataFrame(data_orders, ["order_id", "customer_id"])

# Anti-join: Selecting only customers who have not placed any orders
anti_join_df = customers_df.join(orders_df, 
                                 on="customer_id", 
                                 how="left_anti")

# Show the result of the anti-join
anti_join_df.show()

In these examples:

orders_df and customers_df are assumed to be the DataFrames representing orders and customers, respectively.
We use PySpark DataFrame operations like join(), select(), distinct(), and subtract() to execute semi-join and anti-join operations.
The resulting DataFrames contain the desired rows based on the semi-join and anti-join criteria.

These examples show how to use semi-join and anti-join in PySpark. They filter rows based on the existence of matching rows in another data frame. They also filter rows based on the absence of matching rows.

References

Semi and Anti Joins Illustrative Examples

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.