Requirement
Assume you have a large amount of data. You frequently perform ad-hoc analyses on this dataset. However, retrieving the results of your query takes time. In order to achieve a fast result, we must perform some optimizations. Bloom Filter Index is one of the optimizations available. In this post, we are going to discuss Bloom Filter Index.
Solution
In order to understand how Bloom Filter Index works, let’s first understand what it is.
This is a space-efficient data structure that allows data to be skipped over selected columns.
Background of Bloom Filter Index
When we create a bloom filter index on a table column, a bloom filter index file is created for each data file. It is this index file that is read before the actual data file is read. If no index exists or the bloom filter is not defined, it will read the data file.
The bloom filter index file is a parquet file with a single row. There is a folder _delta_index where it gets created. The file naming format is the same as the data file with the suffix “.index.v1.parquet“. For example, if the data file name is dbfs:/mount/datafile.0000.parquet.snappy then the index file name will be dbfs:/mount/datafile.0000.parquet.snappy.index.v1.parquet.
It works on FPP aka false positive probability which means either the data is probably in the file or not in the file. The lower the value, the higher the accuracy rate. The size of a bloom filter depends on how many elements the filter has been created for. For each element, 5 bits are required.
Create a Bloom Filter Index
Before creating the bloom filter index, one thing needs to understand – it doesn’t create an index for the data already available in the table. It will create new or rewritten data after enabling the bloom filter index on the table for a column.
Let’s say we have a new table with the below schema:
CREATE TABLE testdb.bloomtabletest ( empno bigint, ename string, designation string, manager string, hire_date date, sal double, depno bigint, location string ) Using DELTA Location '/mnt/bdp/bloomtabletest';
Create bloom filter index on the above table:
CREATE BLOOMFILTER INDEX
ON TABLE bloomtabletest
FOR COLUMNS(ename OPTIONS (fpp=0.1, numItems=1000000);
Here, we have configured the bloom filter index on a column level of a table.
fpp: False Positive Probability value. Default is 0.1
numItems: The distinct value of column a file can contain. The default value is 1 million items.
Once you create this bloom filter, it result will be faster while running any query to find the item for a created column.
Wrapping Up
Throughout this post, we discussed bloom filter indexes and how to create them on a table. The result can be observed by implementing it on high-volume data and checking the results.