Requirement
In this post, we will go through the concept of Bucketing in Hive. This post will cover the below-following points about Bucketing:
- What is Bucketing in Hive?
- How to load data into a Bucketed table?
- What is its importance?
Components Involved
- HIVE
- HDFS
Sample Data
We will use the given sample data for the task.
EMPID | FIRSTNAME | LASTNAME | SPORTS | CITY | COUNTRY |
1001 | Emerry | Blair | Basketball | Qutubullapur | San Marino |
1002 | Zephr | Stephenson | Cricket | Neerharen | Dominican Republic |
1003 | Autumn | Bean | Basketball | Neerharen | Dominican Republic |
1004 | Kasimir | Vance | Badminton | Neerharen | Dominican Republic |
1005 | Mufutau | Flores |
| Qutubullapur | San Marino |
1006 | Ayanna | Banks | Football | Neerharen | Dominican Republic |
1007 | Selma | Ball | Tennis | Qutubullapur | San Marino |
1008 | Berk | Fuller | Badminton | Neerharen | Dominican Republic |
1009 | Imogene | Terrell |
| Qutubullapur | San Marino |
1010 | Colorado | Hutchinson | Tennis | Qutubullapur | San Marino |
This is the sample data of employee details. Some employees are a member of the company’s sports team like Cricket, Football, Basketball, Tennis, Badminton. Some employees are not a member of any sports team that records are highlighted in the above picture.
You can download the sample data from below.
Solution
Let us first understand that what is Bucketing in Hive?
Bucketing is an optimization technique which is used to cluster the datasets into more manageable parts, which helps to optimize the query performance.
Now, let’s start with the second part how to load data into Bucketed Table?
We cannot load the data into a bucketed table using LOAD DATA INPATH command as Hive does not support it. So, we have to create a base table which actually has the data and then we will create a bucketed table and dump the data from the base table to bucketed table.
Step 1: Create Base Table
In this step, we are going to create a base table named employee_base
CREATE TABLE db_bdpbase.employee_base ( emplid INT, firstname STRING, lastname STRING, sports STRING, city STRING, country STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE TBLPROPERTIES("skip.header.line.count"="1");
Our sample file contains a header, so we have to remove it during loading the data into the table. Here, the added properties “skip header” in the table DDL will do that.
Step 2: Load data into Base Table
We are having sample data at ‘/usr/bdp/hive/sample_data.csv’ location in HDFS. Let’s load it into the created table using below command:
LOAD DATA INPATH '/user/bdp/hive/sample_data.csv' INTO TABLE db_bdpbase.employee_base;
Let’s check the data in this table:
Step 3: Create Bucketed Table
In this step, we will create a bucketed table. We can either create a bucketed table with a partition or without partition. We will create both the tables.
Case I: Bucketed Table With Partition
In this case, a partition will be bucketed. Below is the table structure:
CREATE TABLE db_bdpbase.bucketed_partition_tbl ( empid INT, firstname STRING, lastname STRING, sports STRING, city STRING ) PARTITIONED BY(country STRING) CLUSTERED BY (empid) SORTED BY (empid ASC) INTO 4 BUCKETS;
Here, the country is the partition column and table bucketed on the empid column which is sorted in ascending order.
Case II: Bucketed Table Without Partition
In this table, we are creating a bucketed table without any partition.
CREATE TABLE db_bdpbase.bucketed_tbl_only ( empid INT, firstname STRING, lastname STRING, city STRING, Country STRING ) CLUSTERED BY (empid) SORTED BY (empid ASC) INTO 4 BUCKETS;
The table has been bucketed on the same column empid.
Step 4: Set Property
By default, the bucket is disabled in Hive. We have to enable it by setting value true to the below property in the hive:
set hive.enforce.bucketing=true; (Not needed in Hive 2.x onward)
This property will select the number of reducers and the cluster by column automatically based on the table.
Step 5: Load Data Into Bucketed Table
We have created two bucketed table and 1 base table which actually have the sample data. Now let’s load data into the bucketed table from the base table using the below query:
Insert into the bucketed table with partition:
INSERT OVERWRITE TABLE db_bdpbase.bucketed_partition_tbl PARTITION (country) SELECT * FROM db_bdpbase.employee_base;
Now, insert into the bucketed table which is without any partition:
INSERT OVERWRITE TABLE db_bdpbase.bucketed_tbl_only SELECT * FROM db_bdpbase.employee_base;
Step 6: Bucketed Table Data
We have loaded data into the bucketed table. Let’s check how the data have been stored at the HDFS location:
First checking on the bucketed table with partition.
hadoop fs -ls hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/db_bdpbase.db/bucketed_partition_tbl
Here, the HDFS directory having two subdirectories which actually represents partitions of the table.
Check each partition directory,
Here, if you see in the screenshot each partition(highlighted by an orange rectangle) have 4 files(highlighted by a red rectangle) as we had defined 4 buckets in the created table.
Now let’s see for the another bucketed table which is without any partition.
This table having only 4 files for each bucket without any partition.
Understand Data Storage in Bucket
Each data get mapped to the specific bucket based on the below formula:
hash_function(bucket_column) mode num_bucket
Here, if we take the first table which is partitioned by country, then our sample data will get split like below:
EMPID | FIRSTNAME | LASTNAME | SPORTS | CITY | COUNTRY |
1002 | Zephr | Stephenson | Cricket | Neerharen | Dominican Republic |
1003 | Autumn | Bean | Basketball | Neerharen | Dominican Republic |
1004 | Kasimir | Vance | Badminton | Neerharen | Dominican Republic |
1006 | Ayanna | Banks | Football | Neerharen | Dominican Republic |
1008 | Berk | Fuller | Badminton | Neerharen | Dominican Republic |
and,
EMPID | FIRSTNAME | LASTNAME | SPORTS | CITY | COUNTRY |
1001 | Emerry | Blair | Basketball | Qutubullapur | San Marino |
1005 | Mufutau | Flores |
| Qutubullapur | San Marino |
1007 | Selma | Ball | Tennis | Qutubullapur | San Marino |
1009 | Imogene | Terrell |
| Qutubullapur | San Marino |
1010 | Colorado | Hutchinson | Tennis | Qutubullapur | San Marino |
For country Dominican Republic,
Each row will get stored in each bucket as below:
hash_function(1002) mode 4 = 2 (Representing index of bucket) hash_function(1003) mode 4 = 3 hash_function(1004) mode 4 = 0 hash_function(1006) mode 4 = 2 hash_function(1008) mode 4 = 0
Here, hash_function of INT value will give the same result. Let’s check the data in each file at the HDFS location:
Likewise, we can also check for the country, San Marino.
Wrapping Up
In this post, we have seen what is bucketing in hive and how we can load data into the bucketed table. In addition to this, we have seen how to create a bucketed table with partition and without partitions. You can optimize the data storage using the bucketed table. The map-side join will be faster on the bucketed table and also you can store data in sorted order on one or multi-columns in each bucket.