This is a PySpark SQL query. It provides a match percentage for a specific column when comparing two tables, Table1 and Table2.

PySpark SQL for Match Percentage

WITH distinct_table1 AS (
SELECT DISTINCT sku_code
FROM table1
WHERE s_code IS NOT NULL
),
distinct_table2 AS (
SELECT DISTINCT sku_code
FROM table2
WHERE s_code IS NOT NULL
),
match_counts AS (
SELECT
COUNT(*) AS matching_count
FROM
distinct_table2 t2
INNER JOIN
distinct_table1 t1
ON
t2.s_code = t1.s_code
),
total_table1 AS (
SELECT
COUNT(*) AS total_count
FROM
distinct_table1
),
total_table2 AS (
SELECT
COUNT(*) AS total_count
FROM
distinct_table2
)
SELECT
mc.matching_count,
tt1.total_count AS total_table1_count,
tt2.total_count AS total_table2_count,
CASE
WHEN tt2.total_count = 0 THEN 0 -- Handle division by zero
ELSE ROUND((mc.matching_count * 100.0) / tt2.total_count, 2)
END AS match_percentage
FROM
match_counts mc
CROSS JOIN
total_table1 tt1
CROSS JOIN
total_table2 tt2;

Databricks Top Course

Explanation

  1. distinct_table1 CTE:
    • Extracts unique s_code values from Table 1, removing duplicates and filtering out NULL values.
  2. distinct_table2 CTE:
    • Extracts unique s_code values from Table 2, removing duplicates and filtering out NULL values.
  3. match_counts CTE:
    • Joins the distinct_table1 and distinct_table2 on s_code to count only unique matches.
  4. total_table1 and total_table2 CTEs:
    • Calculate the total number of unique s_code values in Table 1 and Table 2, respectively.
  5. Main Query:
    • Combines the counts and calculates the match percentage based on unique s_code values in Table 2.

Example Data

Table 1 (Master):

s_code
A123
B456
C789
D012
A123

Table 2:

s_code
A123
C789
X999
A123

Output:

matching_counttotal_table1_counttotal_table2_countmatch_percentage
24366.67

References