Vacuum Delta Table

Requirement

As an example, let’s say you have created a Delta table in Databricks. The Delta table is used to keep track of versions of the data. It is a good idea to remove older versions of data from delta tables. This is where the vacuum comes into play. In line with its name, vacuums remove old versions. The purpose of this post is to explain how to vacuum a Delta table.

Solution

It would be helpful to understand the structure of the Delta table data before jumping into the solution.

As shown in the screenshot above, a file and folder named _delta_log are available which is a transaction log for a table that stores data commits. This folder indicates a delta table or delta lake. The remaining files represented as part-0000* are the actual data files with snappy compression and in parquet format.

Why Vacuum?

As we know, the delta table keeps data versions. If we don’t remove old versions, then the lake size will keep growing. It’s better to remove older version data as these will not get used in our process. We always process the latest version of data. The vacuum removes all files of the older versions.

Now, you may ask, how to know the details of the data versions available. You can get it by running the SQL command on the table.

%sql
DESCRIBE HISTORY <tablename>;

Vacuum Delta Table

In order to perform a vacuum, we need to run a SQL function called vacuum on the table:

VACUUM
<table_name>

Please note, once you run VACUUM on a delta table, you will lose the older versions of data which is irreversible. In case you want to keep an older version for some hours, you can keep using the below command:

VACUUM
<table_name> RETAIN 168 HOURS

Here, I am keeping only the last 7 days (168 hours) of older versions of data instead of all older data versions.

Wrapping Up

In this post, we have seen what a vacuum is, why to use a Vacuum and how to use a Vacuum on the Delta table. This is the best way to optimize the table. You can create a script for all the tables and schedule it once in a week.

Sharing is caring!

Subscribe to our newsletter
Loading

Leave a Reply