I created a technique for comparing hashed columns in a table to verify accuracy before and after modifications. I developed a SQL script to automate the task and reduce the time required for manual comparisons.

  1. MD5 vs SHA-256: Top Differences
    1. Can MD5 or SHA-256 be Used for Encryption or Decryption?
  2. Databricks Hash Comparison
    1. Create a Table and Insert Rows
    2. Update the Table with New Hashes
    3. Compare with Delta Time Travel (Version 1 vs. Current Version)
  3. Conclusion

MD5 vs SHA-256: Top Differences

FeatureMD5SHA-256
Hash TypeCryptographic hash functionCryptographic hash function
Output Length128 bits (32-character hex)256 bits (64-character hex)
SpeedVery fastSlower than MD5
SecurityWeak — vulnerable to collisions & attacksStrong — no practical collision attacks yet
Collision ResistancePoor — collisions foundStrong
Pre-image ResistanceWeakStrong
Designed ForIntegrity checks, digital signatures (historically)Secure cryptographic applications
Year Released19912001 (part of SHA-2 family)
Encryption/DecryptionNo — Not reversible (not encryption)No — Not reversible (not encryption)
Salt Support❌ No built-in support❌ No built-in support (but used with salt often)
Password Hashing Use❌ Not recommended✅ Recommended (with salt & stretching)
Common UsesFile integrity checks, duplicate detectionDigital certificates, blockchain, secure hashing
Broken?✅ Yes — considered broken for security❌ No

Can MD5 or SHA-256 be Used for Encryption or Decryption?

No. MD5 and SHA-256 are hash functions, not encryption algorithms.

  • 🔐 Hashing is one-way: you input data and get a fixed-size hash output. You cannot get the original data back from the hash.
  • 🔐 Encryption is two-way: you can encrypt and later decrypt data using a key.

Databricks Hash Comparison

  • Consider a table named product_table with 1 million rows.
  • The hash_old is MD5 hashed and uses the ID and brand for hashing.
  • The id column has been updated to a new format, generating a new MD5 hash (hash_new).
  • Verify if the new hash matches the original values.

Create a Table and Insert Rows

-- Enable Delta format in Databricks
CREATE TABLE product_table (
    id STRING,
    brand STRING,
    hash_old STRING,
    hash_new STRING
) USING DELTA;
INSERT INTO product_table VALUES
    ('12345', 'Nike', MD5(CONCAT('12345', 'Nike')), NULL),
    ('67890', 'Adidas', MD5(CONCAT('67890', 'Adidas')), NULL),
    ('54321', 'Puma', MD5(CONCAT('54321', 'Puma')), NULL);

Table data

Table output

Update the Table with New Hashes

Now, assume id has been changed to a new format (e.g., prefixed with NEW_).

UPDATE product_table
SET id = CONCAT('NEW_', id), 
    hash_new = MD5(CONCAT(CONCAT('NEW_', id), brand));
SELECT * FROM product_table;

Rows after the update

Table data after update query

Compare with Delta Time Travel (Version 1 vs. Current Version)

Compare the old and new data using the time travel table. Version 1 is the old table because version 0 is for creating the table, and version 1 is after the data has been inserted.

View the original data (Version 1):

select * from product_table version as of 1;
id	brand	hash_old	                     hash_new
12345	Nike	b543eb60934e3e5429aa97d97fcf4fe1	null
67890	Adidas	a90b98d5701948d1df23d96f6cea32e5	null
54321	Puma	fcaad176b3cc1370d25df95e84d641fa	null

Compare old and new hashes side by side:

SELECT 
    old.id AS old_id, 
    old.brand, 
    old.hash_old,
    new.id AS new_id, 
    new.hash_new,
    CASE 
        WHEN old.hash_old = new.hash_new THEN 'MATCH' 
        ELSE 'MISMATCH' 
    END AS hash_status
FROM product_table VERSION AS OF 1 old
JOIN product_table new 
ON old.brand = new.brand;

The MISMATCH in the output says that the hash is converted successfully and does not match the old value.

The comparison of old and new hash value

Conclusion

  • Do a row count to check if data is lost.
  • Check the hash formula consistency.
  • The MISMATCHES are fine since the new hash was created based on the new ID.