How to Compare All Columns of a Table as of Version 0 and as of Version 1 in Databricks SQL

In Databricks, you often need to track data changes over time. One example is comparing a table’s content between different versions. This blog post will guide you in comparing all table columns from version 0 to version 1 using Databricks SQL.

What is Version Control in Databricks?

Version control in Databricks allows tracking changes to data tables over time, making it possible to recover previous versions. This is vital for audit trails, compliance, and rollbacks. Databricks uses Delta Lake, which offers built-in versioning features.

The Setup

Let’s assume we have a Delta table named customer_data. In our example, version 0 contains the initial dataset of customers, and version 1 includes some updates and new entries.

Step 1: Creating a Sample Delta Table

For demonstration purposes, create a Delta table with some initial data:

CREATE TABLE customer_data
(
    customer_id INT,
    name STRING,
    email STRING,
    updated_at TIMESTAMP
)
USING DELTA;

INSERT INTO customer_data VALUES 
(1, 'Alice Smith', 'alice@example.com', '2023-01-01 10:00:00'),
(2, 'Bob Johnson', 'bob@example.com', '2023-01-01 10:05:00');

Step 2: Version 0 of the Table

At this point, the table exists in version 0. We can confirm its version:

DESCRIBE HISTORY customer_data;

Step 3: Updating the Table to Create Version 1

Now, let’s update the table by adding a new customer and updating an existing one:

INSERT INTO customer_data VALUES 
(1, 'Alice Smith', 'alice_new@example.com', '2023-01-02 14:00:00'),
(3, 'Charlie Brown', 'charlie@example.com', '2023-01-02 14:05:00');

Make sure changes are as version 1. This verification is essential for maintaining accuracy and preventing potential issues. It is crucial to verify this to avoid any complications.

DESCRIBE HISTORY customer_data;

Step 4: Comparing Version 0 and Version 1

To compare the two versions of the table, we can utilize the VERSION AS OF clause in Databricks SQL. We will select all columns from both versions and identify differences.

Query for Version 0:

SELECT customer_id, name, email, updated_at 
FROM customer_data VERSION AS OF 0;

Query for Version 1:

SELECT customer_id, name, email, updated_at 
FROM customer_data VERSION AS OF 1;

Step 5: Full Comparison Using a JOIN

To effectively find the differences, we can perform a full outer join on these two queries:

SELECT 
    COALESCE(v0.customer_id, v1.customer_id) AS customer_id,
    v0.name AS name_v0,
    v1.name AS name_v1,
    v0.email AS email_v0,
    v1.email AS email_v1,
    v0.updated_at AS updated_at_v0,
    v1.updated_at AS updated_at_v1
FROM 
    (SELECT customer_id, name, email, updated_at FROM customer_data VERSION AS OF 0) v0
FULL OUTER JOIN 
    (SELECT customer_id, name, email, updated_at FROM customer_data VERSION AS OF 1) v1
ON 
    v0.customer_id = v1.customer_id
WHERE 
    v0.name != v1.name OR
    v0.email != v1.email OR
    v0.updated_at != v1.updated_at OR
    v0.customer_id IS NULL OR
    v1.customer_id IS NULL;

Conclusion

By following the above steps, you can easily compare all table columns across different versions in Databricks SQL. This process helps to track changes effectively and maintain data integrity over time. Leveraging the Delta Lake capabilities allows for vivid data management and versioning in your data workflows.

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.