This is a PySpark SQL query. It provides a match percentage for a specific column when comparing two tables, Table1 and Table2.
PySpark SQL for Match Percentage
WITH distinct_table1 AS (
SELECT DISTINCT sku_code
FROM table1
WHERE s_code IS NOT NULL
),
distinct_table2 AS (
SELECT DISTINCT sku_code
FROM table2
WHERE s_code IS NOT NULL
),
match_counts AS (
SELECT
COUNT(*) AS matching_count
FROM
distinct_table2 t2
INNER JOIN
distinct_table1 t1
ON
t2.s_code = t1.s_code
),
total_table1 AS (
SELECT
COUNT(*) AS total_count
FROM
distinct_table1
),
total_table2 AS (
SELECT
COUNT(*) AS total_count
FROM
distinct_table2
)
SELECT
mc.matching_count,
tt1.total_count AS total_table1_count,
tt2.total_count AS total_table2_count,
CASE
WHEN tt2.total_count = 0 THEN 0 -- Handle division by zero
ELSE ROUND((mc.matching_count * 100.0) / tt2.total_count, 2)
END AS match_percentage
FROM
match_counts mc
CROSS JOIN
total_table1 tt1
CROSS JOIN
total_table2 tt2;
Explanation
distinct_table1CTE:- Extracts unique
s_codevalues from Table 1, removing duplicates and filtering outNULLvalues.
- Extracts unique
distinct_table2CTE:- Extracts unique
s_codevalues from Table 2, removing duplicates and filtering outNULLvalues.
- Extracts unique
match_countsCTE:- Joins the
distinct_table1anddistinct_table2ons_codeto count only unique matches.
- Joins the
total_table1andtotal_table2CTEs:- Calculate the total number of unique
s_codevalues in Table 1 and Table 2, respectively.
- Calculate the total number of unique
- Main Query:
- Combines the counts and calculates the match percentage based on unique
s_codevalues in Table 2.
- Combines the counts and calculates the match percentage based on unique
Example Data
Table 1 (Master):
| s_code |
|---|
| A123 |
| B456 |
| C789 |
| D012 |
| A123 |
Table 2:
| s_code |
|---|
| A123 |
| C789 |
| X999 |
| A123 |
Output:
| matching_count | total_table1_count | total_table2_count | match_percentage |
|---|---|---|---|
| 2 | 4 | 3 | 66.67 |
References






