Auto Optimize Delta Table in Databricks

Requirement

In the last post, We covered how to vacuum a delta table in our previous post. In this article, we’ll talk about  Auto Optimize Delta table in Databricks.

Solution

Before moving on to the solution, let’s first clarify what Auto Optimize is and how it functions before talking about how to use it.

What is Auto Optimize

It is a feature that aids in handling the small size of files being compacted during write operations to Delta Tables. In essence, It merges the small size files so that less number of files are available while querying.

It is more beneficial when having a Merge operation for writing data into the Delta table, CREATE a new table from SELECT query, etc.

Internals of Auto Optimize

It performs 2 operations one after another:

  1. Optimize files while writing: Writes the file to reduce the number of small files by merging them into a bigger file of size around 128MB files for each table partition.
  2. Auto Compaction: After the writing operation, it checks if files can further be compacted for the partitions having more small files.

We understand how Auto Optimize works. Let’s take a look at how we can enable it.

Approaches for Enabling Auto Optimize in Databricks

We can enable these features in a various ways for different scenarios.

Enable for New Delta Table:

We can enable it with Table properties while creating the table.

%sql

CREATE TABLE employee(empno bigint, ename  string, designation string, manager string, hire_date date, sal double, deptno bigint, location string) TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true, delta.autoOptimize.autoCompact = true)

Here, delta.autoOptimize.optimizeWrite properties used for Optimize Write

and delta.autoOptimize.autoCompact properties used for Compaction.

Enable for Existing Created Table

We can enable Auto Optimize in an already created table using the ALETR command:

%sql

ALTER TABLE employee SET TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true, delta.autoOptimize.autoCompact = true)

Enable Auto Optimize in Spark Session

In case you want to enable it at the Spark Session level, then can also be done by setting the same properties:

%sql
 set spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite = true;

set spark.databricks.delta.properties.defaults.autoOptimize.autoCompact = true;

Note: Spark Session properties overwrite the table properties.

We can also perform compaction using OPTIMIZE command:

%sql
OPTIMIZE employee;

Here, if Auto compaction properties are enabled then no need to perform OPTIMIZE separately.

Note: OPTIMIZE command compact of size 1GB, whereas Auto Compaction generates a smaller file of size 128 MB.

When not to use Auto Optimize

As of now, we have discussed the use of Auto Optimize. It may not be effective to enable in some scenarios, and not enabling may be preferable.

  • Datasets that are large, say in terabytes
  • Auto compaction can create a transaction conflict when multiple jobs merge (update, delete, insert) data into the same delta table simultaneously.

Wrapping Up

This post discusses the Auto Optimize feature in Databricks. In terms of optimization, it is very helpful because it requires less effort. There might be a little extra cost since shuffles are performed as part of optimizing the writing in the table. Although reading data from tables can be worthwhile.

Sharing is caring!

Subscribe to our newsletter
Loading

Leave a Reply