Calculate Match Percentage in PySpark SQL

This is a PySpark SQL query. It provides a match percentage for a specific column when comparing two tables, Table1 and Table2.

PySpark SQL for Match Percentage

WITH distinct_table1 AS (
    SELECT DISTINCT sku_code
    FROM table1
    WHERE s_code IS NOT NULL
),
distinct_table2 AS (
    SELECT DISTINCT sku_code
    FROM table2
    WHERE s_code IS NOT NULL
),
match_counts AS (
    SELECT 
        COUNT(*) AS matching_count
    FROM 
        distinct_table2 t2
    INNER JOIN 
        distinct_table1 t1
    ON 
        t2.s_code = t1.s_code
),
total_table1 AS (
    SELECT 
        COUNT(*) AS total_count
    FROM 
        distinct_table1
),
total_table2 AS (
    SELECT 
        COUNT(*) AS total_count
    FROM 
        distinct_table2
)
SELECT 
    mc.matching_count,
    tt1.total_count AS total_table1_count,
    tt2.total_count AS total_table2_count,
    CASE 
        WHEN tt2.total_count = 0 THEN 0  -- Handle division by zero
        ELSE ROUND((mc.matching_count * 100.0) / tt2.total_count, 2)
    END AS match_percentage
FROM 
    match_counts mc
CROSS JOIN 
    total_table1 tt1
CROSS JOIN 
    total_table2 tt2;

Databricks Top Course

Explanation

distinct_table1 CTE:
- Extracts unique s_code values from Table 1, removing duplicates and filtering out NULL values.
distinct_table2 CTE:
- Extracts unique s_code values from Table 2, removing duplicates and filtering out NULL values.
match_counts CTE:
- Joins the distinct_table1 and distinct_table2 on s_code to count only unique matches.
total_table1 and total_table2 CTEs:
- Calculate the total number of unique s_code values in Table 1 and Table 2, respectively.
Main Query:
- Combines the counts and calculates the match percentage based on unique s_code values in Table 2.

Example Data

Table 1 (Master):

s_code
A123
B456
C789
D012
A123

Table 2:

s_code
A123
C789
X999
A123

Output:

matching_count	total_table1_count	total_table2_count	match_percentage
2	4	3	66.67

References

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.