How to Compare Hashed Columns Before and After a Change in Databricks

I created a technique for comparing hashed columns in a table to verify accuracy before and after modifications. I developed a SQL script to automate the task and reduce the time required for manual comparisons.

MD5 vs SHA-256: Top Differences

Feature	MD5	SHA-256
Hash Type	Cryptographic hash function	Cryptographic hash function
Output Length	128 bits (32-character hex)	256 bits (64-character hex)
Speed	Very fast	Slower than MD5
Security	Weak — vulnerable to collisions & attacks	Strong — no practical collision attacks yet
Collision Resistance	Poor — collisions found	Strong
Pre-image Resistance	Weak	Strong
Designed For	Integrity checks, digital signatures (historically)	Secure cryptographic applications
Year Released	1991	2001 (part of SHA-2 family)
Encryption/Decryption	❌ No — Not reversible (not encryption)	❌ No — Not reversible (not encryption)
Salt Support	❌ No built-in support	❌ No built-in support (but used with salt often)
Password Hashing Use	❌ Not recommended	✅ Recommended (with salt & stretching)
Common Uses	File integrity checks, duplicate detection	Digital certificates, blockchain, secure hashing
Broken?	✅ Yes — considered broken for security	❌ No

Can MD5 or SHA-256 be Used for Encryption or Decryption?

No. MD5 and SHA-256 are hash functions, not encryption algorithms.

🔐 Hashing is one-way: you input data and get a fixed-size hash output. You cannot get the original data back from the hash.
🔐 Encryption is two-way: you can encrypt and later decrypt data using a key.

🔁 Think of hashing like blending a smoothie: you can’t un-blend it to get the exact original ingredients.

Databricks Hash Comparison

Consider a table named product_table with 1 million rows.
The hash_old is MD5 hashed and uses the ID and brand for hashing.
The id column has been updated to a new format, generating a new MD5 hash (hash_new).
Verify if the new hash matches the original values.

Create a Table and Insert Rows

-- Enable Delta format in Databricks
CREATE TABLE product_table (
    id STRING,
    brand STRING,
    hash_old STRING,
    hash_new STRING
) USING DELTA;
INSERT INTO product_table VALUES
    ('12345', 'Nike', MD5(CONCAT('12345', 'Nike')), NULL),
    ('67890', 'Adidas', MD5(CONCAT('67890', 'Adidas')), NULL),
    ('54321', 'Puma', MD5(CONCAT('54321', 'Puma')), NULL);

Table data

Update the Table with New Hashes

Now, assume id has been changed to a new format (e.g., prefixed with NEW_).

UPDATE product_table
SET id = CONCAT('NEW_', id), 
    hash_new = MD5(CONCAT(CONCAT('NEW_', id), brand));
SELECT * FROM product_table;

Rows after the update

Compare with Delta Time Travel (Version 1 vs. Current Version)

Compare the old and new data using the time travel table. Version 1 is the old table because version 0 is for creating the table, and version 1 is after the data has been inserted.

View the original data (Version 1):

select * from product_table version as of 1;

id	brand	hash_old	                     hash_new
12345	Nike	b543eb60934e3e5429aa97d97fcf4fe1	null
67890	Adidas	a90b98d5701948d1df23d96f6cea32e5	null
54321	Puma	fcaad176b3cc1370d25df95e84d641fa	null

Compare old and new hashes side by side:

SELECT 
    old.id AS old_id, 
    old.brand, 
    old.hash_old,
    new.id AS new_id, 
    new.hash_new,
    CASE 
        WHEN old.hash_old = new.hash_new THEN 'MATCH' 
        ELSE 'MISMATCH' 
    END AS hash_status
FROM product_table VERSION AS OF 1 old
JOIN product_table new 
ON old.brand = new.brand;

The MISMATCH in the output says that the hash is converted successfully and does not match the old value.

The comparison of old and new hash value

Conclusion

Do a row count to check if data is lost.
Check the hash formula consistency.
The MISMATCHES are fine since the new hash was created based on the new ID.

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.