Partitioning in Hive with example - BIG DATA PROGRAMMERS

Requirement

Suppose there is a source data, which is required to store in the hive partitioned table. So our requirement is to store the data in the hive table with static and dynamic partitions. With an understanding of partitioning in the hive, we will see where to use the static and dynamic partitions.

Solutions

First, create a temp table to store the data. This table will have all the data and from this table, we will load data into static and dynamic partitioned hive table based on the partitioned column(s).

 CREATE TABLE partitioned_temp(
empId INT,
firstname STRING,
lastname STRING,
city STRING,
mobile STRING,    
yearofexperience INT) 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' 
LINES TERMINATED BY '\n' 
STORED as TEXTFILE;

Load data into the created table:

 LOAD DATA LOCAL INPATH ‘/user/data/employee.txt’ INTO TABLE partitioned_temp;

Static partition: Implement on Managed table

Suppose below are the sample data which we will store in the static partitioned hive table:

 empId,firstName,lastNam,city,mobile,yearofexperience
 101,A,Z,Kolkata,9000000001,3
 102,B,Y,Bangalore,9000000002,3
 103,C,X,Mumbai,9000000003,3
 104,D,W,Pune,9000000004,5

PARTITION TABLE CREATION FORMAT:

 CREATE TABLE <table_name>(
 Mention_all_non_partition_column_name_along with data type
 ) PARTITIONED BY (mention_partition_column_name_along_with_data_type)
 ROW FORMAT DELIMITED
 FIELDS TERMINATED BY ','
 LINES TERMINATED BY '\n'
 STORED as TEXTFILE;

Case I: Partition column is a table column

 CREATE TABLE partitioned_test_managed(
 empId INT,
 firstname STRING,
 lastname STRING,
 city STRING,
 mobile STRING)
 PARTITIONED BY (yearofexperience INT)
 ROW FORMAT DELIMITED
 FIELDS TERMINATED BY ','
 LINES TERMINATED BY '\n'
 STORED as TEXTFILE;

Here, I have mentioned all the non-partitioned columns in create block and partitioned column in “partitioned by” clause which is a column of the table.

Load data from FILE

Command to Load the data into the static partitioned table:

 LOAD DATA LOCAL INPATH ‘/user/data/employee.txt’ INTO TABLE partitioned_test_managed PARTITION (yearofexperience=3);

Here, I have loaded the data by mentioning partition value 3 i.e. all the data will be in directory yearofexperience=3. The full path of the directory is /user/hive/warehouse/partitioned_test_managed/yearofexperience=3

Likewise, you can load different data based on different partition value. It will create a new directory for new partition value.

Load Data from another hive table

Command to Load the data into the table:

 INSERT OVERWRITE TABLE partitioned_test_managed
 PARTITION(yearofexperience=3)
 SELECT empId,firstname,lastname,city,mobile FROM partitioned_temp temp
 WHERE temp.yearofexperience=3;

In the above query, if you notice we are selecting only non-partitioned columns (only 5 columns) from temp table and providing yearofexperience=3. So, the partition table will have all the records of the temp table in this partition.

Case II: Partition column is not a table column

Our source data have six columns (empId, firstname, lastname, city, mobile, yearofexperience), but we want to have an extra column which will act as a partition column.

Let’s have a scenario where data is coming on a daily basis, in this case, we will store data on date wise. For this, we will add partition column named as dateValue(taking data type string).

Create table query:

 CREATE TABLE partitioned_test_managed(
 empId int,
 firstname STRING,
 lastname STRING,
 city STRING,
 mobile STRING,
 yearofexperience INT
 ) PARTITIONED BY (dateValue STRING)
 ROW FORMAT DELIMITED
 FIELDS TERMINATED BY ','
 LINES TERMINATED BY '\n'
 STORED as TEXTFILE;

Load data from FILE

Command to Load the data into the table:

 LOAD DATA LOCAL INPATH ‘/user/data/employee.txt’ INTO TABLE partitioned_test_managed PARTITION (dateValue=20160101);

Here, we have stored the data in the date directory (20160101). If you provide next date value and load the data, it will create a new date directory and will store the data.

Load data from another hive table

Command to Load the data into the table:

 INSERT OVERWRITE TABLE partitioned_test_managed
 PARTITION(dateValue=20160101)
 SELECT * FROM partitioned_temp;

Static partition: Implement on External table

PARTITION TABLE CREATION FORMAT:

 CREATE TABLE <table_name>(
 Mention_all_non_partition_column_name_along with data type
 ) PARTITIONED BY (mention_partition_column_name_along_with_data_type)
 ROW FORMAT DELIMITED
 FIELDS TERMINATED BY ','
 LINES TERMINATED BY '\n'
 STORED as TEXTFILE
 LOCATION ‘<hdfs_location>’;

Case I: Partition column is a table column

Create an external table:

 CREATE TABLE partitioned_test_external(
 empId INT,
 firstname STRING,
 lastname STRING,
 city STRING,
 mobile STRING)
 PARTITIONED BY (yearofexperience INT)
 ROW FORMAT DELIMITED
 FIELDS TERMINATED BY ','
 LINES TERMINATED BY '\n'
 STORED as TEXTFILE LOCATION '/user/bdp/hiveData/yearofexperience=3';

Load data from FILE

We have partition data at location ‘/user/bdp/hiveData/yearofexperience=3’. So we have created an external hive table and mentioned the location as an external location.

For loading data from HDFS file, it is required to add a partition to the table, so that it can update in meta store. To add partition, alter the table as shown below:

 ALTER TABLE partitioned_test_external ADD PARTITION (yearofexperience=3) LOCATION '/user/hiveData/yearofexperience=3';

Load data from another hive table

For loading data from another hive table, execute the below query. Here, we are creating partition yearofexperience=5.

 INSERT OVERWRITE TABLE partitioned_test_managed
 PARTITION(yearofexperience=5)
 SELECT empId,firstname,lastname,city,mobile FROM partitioned_temp temp
 WHERE temp.yearofexperience=5;

Dynamic partition

As we have seen in the Static partition while creating partition we were giving some static value for the partition column. In dynamic partition, the partition will happen dynamically i.e. it will create a partition based on the value of the partition column.

Create a table either managed or an external table. I am creating an external table:

 CREATE TABLE partitioned_test_external(
 empId INT,
 firstname STRING,
 lastname STRING,
 city STRING,
 mobile STRING)
 PARTITIONED BY (yearofexperience INT)
 ROW FORMAT DELIMITED
 FIELDS TERMINATED BY ','
 LINES TERMINATED BY '\n'
 STORED as TEXTFILE
 LOCATION '/user/bdp/hiveData';

I have deleted all the existing partition from the mentioned path. Now load the data from another non-partitioned table.

 INSERT OVERWRITE TABLE partitioned_test_external
 PARTITION(yearofexperience)
 SELECT empId,firstname,lastname,city,mobile,yearofexperience FROM partitioned_temp;

Note: In the above query, I have mentioned the partition column (yearofexperience) name at last while selecting from the temp table. Always mention partition column at last in dynamic partition otherwise it will do partitioning based on the column which is mentioned at last.

Verify data in the table:

Wrapping Up

Partitioning in Hive plays an important role while storing the bulk of data. With the hive partitioned table, you can query on the specific bulk of data as it is available in the partition. Partitioning is best to improve the query performance when we are looking for a specific bulk of data (eg. monthly data from yearly data).

Partitioning in Hive

Requirement

Solutions

Static partition: Implement on Managed table

Case I: Partition column is a table column

Case II: Partition column is not a table column

Static partition: Implement on External table

Case I: Partition column is a table column

Dynamic partition

Wrapping Up

Leave a Reply Cancel reply

Tags