Requirement
In the last post, We covered how to vacuum a delta table in our previous post. In this article, we’ll talk about Auto Optimize Delta table in Databricks.
Solution
Before moving on to the solution, let’s first clarify what Auto Optimize is and how it functions before talking about how to use it.
What is Auto Optimize
It is a feature that aids in handling the small size of files being compacted during write operations to Delta Tables. In essence, It merges the small size files so that less number of files are available while querying.
It is more beneficial when having a Merge operation for writing data into the Delta table, CREATE a new table from SELECT query, etc.
Internals of Auto Optimize
It performs 2 operations one after another:
- Optimize files while writing: Writes the file to reduce the number of small files by merging them into a bigger file of size around 128MB files for each table partition.
- Auto Compaction: After the writing operation, it checks if files can further be compacted for the partitions having more small files.
We understand how Auto Optimize works. Let’s take a look at how we can enable it.
Approaches for Enabling Auto Optimize in Databricks
We can enable these features in a various ways for different scenarios.
Enable for New Delta Table:
We can enable it with Table properties while creating the table.
|
Here, delta.autoOptimize.optimizeWrite properties used for Optimize Write
and delta.autoOptimize.autoCompact properties used for Compaction.
Enable for Existing Created Table
We can enable Auto Optimize in an already created table using the ALETR command:
|
Enable Auto Optimize in Spark Session
In case you want to enable it at the Spark Session level, then can also be done by setting the same properties:
|
Note: Spark Session properties overwrite the table properties.
We can also perform compaction using OPTIMIZE command:
|
Here, if Auto compaction properties are enabled then no need to perform OPTIMIZE separately.
Note: OPTIMIZE command compact of size 1GB, whereas Auto Compaction generates a smaller file of size 128 MB.
When not to use Auto Optimize
As of now, we have discussed the use of Auto Optimize. It may not be effective to enable in some scenarios, and not enabling may be preferable.
- Datasets that are large, say in terabytes
- Auto compaction can create a transaction conflict when multiple jobs merge (update, delete, insert) data into the same delta table simultaneously.
Wrapping Up
This post discusses the Auto Optimize feature in Databricks. In terms of optimization, it is very helpful because it requires less effort. There might be a little extra cost since shuffles are performed as part of optimizing the writing in the table. Although reading data from tables can be worthwhile.