I created a technique for comparing hashed columns in a table to verify accuracy before and after modifications. I developed a SQL script to automate the task and reduce the time required for manual comparisons.
MD5 vs SHA-256: Top Differences
| Feature | MD5 | SHA-256 |
|---|---|---|
| Hash Type | Cryptographic hash function | Cryptographic hash function |
| Output Length | 128 bits (32-character hex) | 256 bits (64-character hex) |
| Speed | Very fast | Slower than MD5 |
| Security | Weak — vulnerable to collisions & attacks | Strong — no practical collision attacks yet |
| Collision Resistance | Poor — collisions found | Strong |
| Pre-image Resistance | Weak | Strong |
| Designed For | Integrity checks, digital signatures (historically) | Secure cryptographic applications |
| Year Released | 1991 | 2001 (part of SHA-2 family) |
| Encryption/Decryption | ❌ No — Not reversible (not encryption) | ❌ No — Not reversible (not encryption) |
| Salt Support | ❌ No built-in support | ❌ No built-in support (but used with salt often) |
| Password Hashing Use | ❌ Not recommended | ✅ Recommended (with salt & stretching) |
| Common Uses | File integrity checks, duplicate detection | Digital certificates, blockchain, secure hashing |
| Broken? | ✅ Yes — considered broken for security | ❌ No |
Can MD5 or SHA-256 be Used for Encryption or Decryption?
No. MD5 and SHA-256 are hash functions, not encryption algorithms.
- 🔐 Hashing is one-way: you input data and get a fixed-size hash output. You cannot get the original data back from the hash.
- 🔐 Encryption is two-way: you can encrypt and later decrypt data using a key.
🔁 Think of hashing like blending a smoothie: you can’t un-blend it to get the exact original ingredients.
Databricks Hash Comparison
- Consider a table named product_table with 1 million rows.
- The hash_old is MD5 hashed and uses the ID and brand for hashing.
- The id column has been updated to a new format, generating a new MD5 hash (hash_new).
- Verify if the new hash matches the original values.
Create a Table and Insert Rows
-- Enable Delta format in Databricks
CREATE TABLE product_table (
id STRING,
brand STRING,
hash_old STRING,
hash_new STRING
) USING DELTA;
INSERT INTO product_table VALUES
('12345', 'Nike', MD5(CONCAT('12345', 'Nike')), NULL),
('67890', 'Adidas', MD5(CONCAT('67890', 'Adidas')), NULL),
('54321', 'Puma', MD5(CONCAT('54321', 'Puma')), NULL);
Table data

Update the Table with New Hashes
Now, assume id has been changed to a new format (e.g., prefixed with NEW_).
UPDATE product_table
SET id = CONCAT('NEW_', id),
hash_new = MD5(CONCAT(CONCAT('NEW_', id), brand));
SELECT * FROM product_table;
Rows after the update

Compare with Delta Time Travel (Version 1 vs. Current Version)
Compare the old and new data using the time travel table. Version 1 is the old table because version 0 is for creating the table, and version 1 is after the data has been inserted.
View the original data (Version 1):
select * from product_table version as of 1;
id brand hash_old hash_new
12345 Nike b543eb60934e3e5429aa97d97fcf4fe1 null
67890 Adidas a90b98d5701948d1df23d96f6cea32e5 null
54321 Puma fcaad176b3cc1370d25df95e84d641fa null
Compare old and new hashes side by side:
SELECT
old.id AS old_id,
old.brand,
old.hash_old,
new.id AS new_id,
new.hash_new,
CASE
WHEN old.hash_old = new.hash_new THEN 'MATCH'
ELSE 'MISMATCH'
END AS hash_status
FROM product_table VERSION AS OF 1 old
JOIN product_table new
ON old.brand = new.brand;
The MISMATCH in the output says that the hash is converted successfully and does not match the old value.

Conclusion
- Do a row count to check if data is lost.
- Check the hash formula consistency.
- The MISMATCHES are fine since the new hash was created based on the new ID.







You must be logged in to post a comment.