Requirement

In this post, we will go through the concept of Bucketing in Hive. This post will cover the below-following points about Bucketing:

What is Bucketing in Hive?
How to load data into a Bucketed table?
What is its importance?

Components Involved

HIVE
HDFS

Sample Data

We will use the given sample data for the task.

EMPID	FIRSTNAME	LASTNAME	SPORTS	CITY	COUNTRY
1001	Emerry	Blair	Basketball	Qutubullapur	San Marino
1002	Zephr	Stephenson	Cricket	Neerharen	Dominican Republic
1003	Autumn	Bean	Basketball	Neerharen	Dominican Republic
1004	Kasimir	Vance	Badminton	Neerharen	Dominican Republic
1005	Mufutau	Flores		Qutubullapur	San Marino
1006	Ayanna	Banks	Football	Neerharen	Dominican Republic
1007	Selma	Ball	Tennis	Qutubullapur	San Marino
1008	Berk	Fuller	Badminton	Neerharen	Dominican Republic
1009	Imogene	Terrell		Qutubullapur	San Marino
1010	Colorado	Hutchinson	Tennis	Qutubullapur	San Marino

This is the sample data of employee details. Some employees are a member of the company’s sports team like Cricket, Football, Basketball, Tennis, Badminton. Some employees are not a member of any sports team that records are highlighted in the above picture.

You can download the sample data from below.

sample_data

Solution

Let us first understand that what is Bucketing in Hive?

Bucketing is an optimization technique which is used to cluster the datasets into more manageable parts, which helps to optimize the query performance.

Now, let’s start with the second part how to load data into Bucketed Table?

We cannot load the data into a bucketed table using LOAD DATA INPATH command as Hive does not support it. So, we have to create a base table which actually has the data and then we will create a bucketed table and dump the data from the base table to bucketed table.

Step 1: Create Base Table

In this step, we are going to create a base table named employee_base

 CREATE TABLE db_bdpbase.employee_base (
    emplid     INT,
    firstname  STRING,
    lastname   STRING,
     sports STRING,
    city       STRING,
    country    STRING
) 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
TBLPROPERTIES("skip.header.line.count"="1");

Our sample file contains a header, so we have to remove it during loading the data into the table. Here, the added properties “skip header” in the table DDL will do that.

Step 2: Load data into Base Table

We are having sample data at ‘/usr/bdp/hive/sample_data.csv’ location in HDFS. Let’s load it into the created table using below command:

 LOAD DATA INPATH '/user/bdp/hive/sample_data.csv' INTO TABLE  db_bdpbase.employee_base;

Let’s check the data in this table:

Step 3: Create Bucketed Table

In this step, we will create a bucketed table. We can either create a bucketed table with a partition or without partition. We will create both the tables.

Case I: Bucketed Table With Partition

In this case, a partition will be bucketed. Below is the table structure:

 CREATE TABLE db_bdpbase.bucketed_partition_tbl (
      empid     INT,
      firstname  STRING,
      lastname   STRING,
      sports STRING,
      city       STRING
) PARTITIONED BY(country STRING)
CLUSTERED BY (empid)
SORTED BY (empid ASC) INTO 4 BUCKETS;

Here, the country is the partition column and table bucketed on the empid column which is sorted in ascending order.

Case II: Bucketed Table Without Partition

In this table, we are creating a bucketed table without any partition.

 CREATE TABLE db_bdpbase.bucketed_tbl_only (
      empid     INT,
      firstname  STRING,
      lastname   STRING,
      city       STRING,
      Country STRING
) 
CLUSTERED BY (empid)
SORTED BY (empid ASC) INTO 4 BUCKETS;

The table has been bucketed on the same column empid.

Step 4: Set Property

By default, the bucket is disabled in Hive. We have to enable it by setting value true to the below property in the hive:

 set hive.enforce.bucketing=true; (Not needed in Hive 2.x onward)

This property will select the number of reducers and the cluster by column automatically based on the table.

Step 5: Load Data Into Bucketed Table

We have created two bucketed table and 1 base table which actually have the sample data. Now let’s load data into the bucketed table from the base table using the below query:

Insert into the bucketed table with partition:

 INSERT OVERWRITE TABLE db_bdpbase.bucketed_partition_tbl PARTITION (country) SELECT * FROM db_bdpbase.employee_base;

Now, insert into the bucketed table which is without any partition:

 INSERT OVERWRITE TABLE db_bdpbase.bucketed_tbl_only SELECT * FROM db_bdpbase.employee_base;

Step 6: Bucketed Table Data

We have loaded data into the bucketed table. Let’s check how the data have been stored at the HDFS location:

First checking on the bucketed table with partition.

 hadoop fs -ls hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/db_bdpbase.db/bucketed_partition_tbl

Here, the HDFS directory having two subdirectories which actually represents partitions of the table.

Check each partition directory,

Here, if you see in the screenshot each partition(highlighted by an orange rectangle) have 4 files(highlighted by a red rectangle) as we had defined 4 buckets in the created table.

Now let’s see for the another bucketed table which is without any partition.

This table having only 4 files for each bucket without any partition.

Understand Data Storage in Bucket

Each data get mapped to the specific bucket based on the below formula:

hash_function(bucket_column) mode num_bucket

Here, if we take the first table which is partitioned by country, then our sample data will get split like below:

EMPID	FIRSTNAME	LASTNAME	SPORTS	CITY	COUNTRY
1002	Zephr	Stephenson	Cricket	Neerharen	Dominican Republic
1003	Autumn	Bean	Basketball	Neerharen	Dominican Republic
1004	Kasimir	Vance	Badminton	Neerharen	Dominican Republic
1006	Ayanna	Banks	Football	Neerharen	Dominican Republic
1008	Berk	Fuller	Badminton	Neerharen	Dominican Republic

and,

EMPID	FIRSTNAME	LASTNAME	SPORTS	CITY	COUNTRY
1001	Emerry	Blair	Basketball	Qutubullapur	San Marino
1005	Mufutau	Flores		Qutubullapur	San Marino
1007	Selma	Ball	Tennis	Qutubullapur	San Marino
1009	Imogene	Terrell		Qutubullapur	San Marino
1010	Colorado	Hutchinson	Tennis	Qutubullapur	San Marino

For country Dominican Republic,

Each row will get stored in each bucket as below:

hash_function(1002) mode 4 = 2 (Representing index of bucket)
hash_function(1003) mode 4 = 3
hash_function(1004) mode 4 = 0
hash_function(1006) mode 4 = 2
hash_function(1008) mode 4 = 0

Here, hash_function of INT value will give the same result. Let’s check the data in each file at the HDFS location:

Likewise, we can also check for the country, San Marino.

Wrapping Up

In this post, we have seen what is bucketing in hive and how we can load data into the bucketed table. In addition to this, we have seen how to create a bucketed table with partition and without partitions. You can optimize the data storage using the bucketed table. The map-side join will be faster on the bucketed table and also you can store data in sorted order on one or multi-columns in each bucket.

Bucketing in Hive