Bucketing in Hive

Bucketing in Hive

Requirement

In this post, we will go through the concept of Bucketing in Hive. This post will cover the below-following points about Bucketing:

  • What is Bucket in Hive?
  • How to load data into a Bucketed table?
  • What is its importance?

Components Involved

  • HIVE
  • HDFS

Sample Data

We will use given sample data for the task.

EMPID

FIRSTNAME

LASTNAME

SPORTS

CITY

COUNTRY

1001

Emerry

Blair

Basketball

Qutubullapur

San Marino

1002

Zephr

Stephenson

Cricket

Neerharen

Dominican Republic

1003

Autumn

Bean

Basketball

Neerharen

Dominican Republic

1004

Kasimir

Vance

Badminton

Neerharen

Dominican Republic

1005

Mufutau

Flores

Qutubullapur

San Marino

1006

Ayanna

Banks

Football

Neerharen

Dominican Republic

1007

Selma

Ball

Tennis

Qutubullapur

San Marino

1008

Berk

Fuller

Badminton

Neerharen

Dominican Republic

1009

Imogene

Terrell

Qutubullapur

San Marino

1010

Colorado

Hutchinson

Tennis

Qutubullapur

San Marino

This is the sample data of employee details. Some employees are the member of the company’s sports team like Cricket, Football, Basketball, Tennis, Badminton. Some employees are not a member of any sports team that records are highlighted in the above picture.

You can download the sample data from below.

sample_data

Solution

Let us first understand that what is Bucketing in Hive?

Bucket is an optimization technique which is used to cluster the datasets into more manageable parts, which helps to optimize the query performance.

Now, let’s start with the second part how to load data into Bucketed Table?

We cannot load the data into a bucketed table using LOAD DATA INPATH command as Hive does not support it. So, we have to create a base table which actually has the data and then we will create a bucketed table and dump the data from the base table to bucketed table.

Step 1: Create Base Table

In this step, we are going to create a base table named employee_base

 
 
CREATE TABLE db_bdpbase.employee_base (
    emplid     INT,
    firstname  STRING,
    lastname   STRING,
     sports STRING,
    city       STRING,
    country    STRING
) 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
TBLPROPERTIES("skip.header.line.count"="1");

Our sample file contains a header, so we have to remove it during loading the data into the table. Here, the added properties “skip header” in the table DDL will do that.

Step 2: Load data into Base Table

We are having sample data at  ‘/usr/bdp/hive/sample_data.csv’ location in HDFS. Let’s load it into the created table using below command:

 
 
LOAD DATA INPATH '/user/bdp/hive/sample_data.csv' INTO TABLE  db_bdpbase.employee_base;

Let’s check the data in this table:

Step 3: Create Bucketed Table

In this step, we will create a bucketed table. We can either create a bucketed table with a partition or without partition. We will create both the tables.

Case I: Bucketed Table With Partition

In this case, a partition will be bucketed. Below is the table structure:

 
 
CREATE TABLE db_bdpbase.bucketed_partition_tbl (
      empid     INT,
      firstname  STRING,
      lastname   STRING,
      sports STRING,
      city       STRING
) PARTITIONED BY(country STRING)
CLUSTERED BY (empid)
SORTED BY (empid ASC) INTO 4 BUCKETS;

Here, the country is the partition column and table bucketed on the empid column which is sorted in ascending order.

Case II: Bucketed Table Without Partition

In this table, we are creating a bucketed table without any partition.

 
 
CREATE TABLE db_bdpbase.bucketed_tbl_only (
      empid     INT,
      firstname  STRING,
      lastname   STRING,
      city       STRING,
      Country STRING
) 
CLUSTERED BY (empid)
SORTED BY (empid ASC) INTO 4 BUCKETS;

The table has been bucketed on the same column empid.

Step 4: Set Property

By default, the bucket is disabled in Hive. We have to enable it by setting value true to the below property in the hive:

 
 
set hive.enforce.bucketing=true; (Not needed in Hive 2.x onward)

This property will select the number of reducers and the cluster by column automatically based on the table.

Step 5: Load Data Into Bucketed Table

We have created two bucketed table and 1 base table which actually have the sample data. Now let’s load data into the bucketed table from base table using the below query:

Insert into the bucketed table with partition:

 
 
INSERT OVERWRITE TABLE db_bdpbase.bucketed_partition_tbl PARTITION (country) SELECT * FROM db_bdpbase.employee_base;

Now, insert into the bucketed table which is without any partition:

 
 
INSERT OVERWRITE TABLE db_bdpbase.bucketed_tbl_only SELECT * FROM db_bdpbase.employee_base;

Step 6: Bucketed Table Data

We have loaded data into the bucketed table. Let’s check how the data have been stored at HDFS location:

First checking on the bucketed table with partition.

 
 
hadoop fs -ls hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/db_bdpbase.db/bucketed_partition_tbl

Here, the HDFS directory having two subdirectory which actually represents partitions of the table.

Check each partition directory,

Here, if you see in the screenshot each partition(highlighted by an orange rectangle) have 4 files(highlighted by a red rectangle) as we had defined 4 buckets in created table.

Now let’s see for the another bucketed table which is without any partition.

This table having only 4 files for each bucket without any partition.

Understand Data Storage in Bucket

Each data get mapped to the specific bucket based on the below formula:

hash_function(bucket_column) mode num_bucket

Here, if we take the first table which is partitioned by country, then our sample data will get split like below:

EMPID

FIRSTNAME

LASTNAME

SPORTS

CITY

COUNTRY

1002

Zephr

Stephenson

Cricket

Neerharen

Dominican Republic

1003

Autumn

Bean

Basketball

Neerharen

Dominican Republic

1004

Kasimir

Vance

Badminton

Neerharen

Dominican Republic

1006

Ayanna

Banks

Football

Neerharen

Dominican Republic

1008

Berk

Fuller

Badminton

Neerharen

Dominican Republic

and,

EMPID

FIRSTNAME

LASTNAME

SPORTS

CITY

COUNTRY

1001

Emerry

Blair

Basketball

Qutubullapur

San Marino

1005

Mufutau

Flores

Qutubullapur

San Marino

1007

Selma

Ball

Tennis

Qutubullapur

San Marino

1009

Imogene

Terrell

Qutubullapur

San Marino

1010

Colorado

Hutchinson

Tennis

Qutubullapur

San Marino

For country Dominican Republic,

Each row will get stored in each bucket as below:

 
 
  1. hash_function(1002) mode 4 = 2 (Representing index of bucket)
  2. hash_function(1003) mode 4 = 3
  3. hash_function(1004) mode 4 = 0
  4. hash_function(1006) mode 4 = 2
  5. hash_function(1008) mode 4 = 0

Here, hash_function of INT value will give the same result. Let’s check the data in each file at HDFS location:

Likewise, we can also check for the country, San Marino.

Wrapping Up

In this post, we have seen what is bucketed table and how we can load data into the bucketed table. In addition to this, we have seen how to create a bucketed table with partition and without partitions. You can optimize the data storage using the bucketed table. Map-side join will be faster on the bucketed table and also you can store data in sorted order on one or multi-columns in each bucket.

2
0

Join in hive with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in pyspark with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in spark using scala with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Java UDF to convert String to date in PIG

About Code Many times it happens like you have received data from many systems and each system operates on a ...
Read More
/ java udf, Pig, pig, pig udf, string to date, udf

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.