In Databricks, you often need to track data changes over time. One example is comparing a table’s content between different versions. This blog post will guide you in comparing all table columns from version 0 to version 1 using Databricks SQL.
What is Version Control in Databricks?
Version control in Databricks allows tracking changes to data tables over time, making it possible to recover previous versions. This is vital for audit trails, compliance, and rollbacks. Databricks uses Delta Lake, which offers built-in versioning features.
The Setup
Let’s assume we have a Delta table named customer_data. In our example, version 0 contains the initial dataset of customers, and version 1 includes some updates and new entries.
Step 1: Creating a Sample Delta Table
For demonstration purposes, create a Delta table with some initial data:
CREATE TABLE customer_data
(
customer_id INT,
name STRING,
email STRING,
updated_at TIMESTAMP
)
USING DELTA;
INSERT INTO customer_data VALUES
(1, 'Alice Smith', 'alice@example.com', '2023-01-01 10:00:00'),
(2, 'Bob Johnson', 'bob@example.com', '2023-01-01 10:05:00');
Step 2: Version 0 of the Table
At this point, the table exists in version 0. We can confirm its version:
DESCRIBE HISTORY customer_data;
Step 3: Updating the Table to Create Version 1
Now, let’s update the table by adding a new customer and updating an existing one:
INSERT INTO customer_data VALUES
(1, 'Alice Smith', 'alice_new@example.com', '2023-01-02 14:00:00'),
(3, 'Charlie Brown', 'charlie@example.com', '2023-01-02 14:05:00');
Make sure changes are as version 1. This verification is essential for maintaining accuracy and preventing potential issues. It is crucial to verify this to avoid any complications.
DESCRIBE HISTORY customer_data;
Step 4: Comparing Version 0 and Version 1
To compare the two versions of the table, we can utilize the VERSION AS OF clause in Databricks SQL. We will select all columns from both versions and identify differences.
Query for Version 0:
SELECT customer_id, name, email, updated_at
FROM customer_data VERSION AS OF 0;
Query for Version 1:
SELECT customer_id, name, email, updated_at
FROM customer_data VERSION AS OF 1;
Step 5: Full Comparison Using a JOIN
To effectively find the differences, we can perform a full outer join on these two queries:
SELECT
COALESCE(v0.customer_id, v1.customer_id) AS customer_id,
v0.name AS name_v0,
v1.name AS name_v1,
v0.email AS email_v0,
v1.email AS email_v1,
v0.updated_at AS updated_at_v0,
v1.updated_at AS updated_at_v1
FROM
(SELECT customer_id, name, email, updated_at FROM customer_data VERSION AS OF 0) v0
FULL OUTER JOIN
(SELECT customer_id, name, email, updated_at FROM customer_data VERSION AS OF 1) v1
ON
v0.customer_id = v1.customer_id
WHERE
v0.name != v1.name OR
v0.email != v1.email OR
v0.updated_at != v1.updated_at OR
v0.customer_id IS NULL OR
v1.customer_id IS NULL;
Conclusion
By following the above steps, you can easily compare all table columns across different versions in Databricks SQL. This process helps to track changes effectively and maintain data integrity over time. Leveraging the Delta Lake capabilities allows for vivid data management and versioning in your data workflows.






