Partitioning in Hive

Partitioning in Hive

Requirement

Suppose there is a source data, which is required to store in hive partition table. So our requirement is to store the data in the hive table with static and dynamic partition. With an understanding of partition in the hive, we will see where to use the static and dynamic partition.

Solutions

First, create a temp table to store the data. This table will have all the data and from this table we will load data into static and dynamic hive partition table based on the partitioned column(s).

create table
 
  1. CREATE TABLE partitioned_temp(
  2. empId INT,
  3. firstname STRING,
  4. lastname STRING,
  5. city STRING,
  6. mobile STRING,
  7. yearofexperience INT)
  8. ROW FORMAT DELIMITED
  9. FIELDS TERMINATED BY ','
  10. LINES TERMINATED BY '\n'
  11. STORED as TEXTFILE;

Load data into the created table:

load data - partition in hive
 
  1. LOAD DATA LOCAL INPATH ‘/user/data/employee.txt’ INTO TABLE partitioned_temp;

Static partition: Implement on Managed table

Suppose below are the sample data which we will store in the hive static partition table:

sampleData - Partition in Hive
 
  1. empId,firstName,lastNam,city,mobile,yearofexperience
  2. 101,A,Z,Kolkata,9000000001,3
  3. 102,B,Y,Bangalore,9000000002,3
  4. 103,C,X,Mumbai,9000000003,3
  5. 104,D,W,Pune,9000000004,5

PARTITION TABLE CREATION FORMAT:

create table template - Partition in Hive
 
  1. CREATE TABLE <table_name>(
  2. Mention_all_non_partition_column_name_along with data type
  3. ) PARTITIONED BY (mention_partition_column_name_along_with_data_type)
  4. ROW FORMAT DELIMITED
  5. FIELDS TERMINATED BY ','
  6. LINES TERMINATED BY '\n'
  7. STORED as TEXTFILE;

Case I: Partition column is a table column

create partition table - Partition in Hive
 
  1. CREATE TABLE partitioned_test_managed(
  2. empId INT,
  3. firstname STRING,
  4. lastname STRING,
  5. city STRING,
  6. mobile STRING)
  7. PARTITIONED BY (yearofexperience INT)
  8. ROW FORMAT DELIMITED
  9. FIELDS TERMINATED BY ','
  10. LINES TERMINATED BY '\n'
  11. STORED as TEXTFILE;

Here, I have mentioned all the non-partition column in create block and partition column in partition block which is a column of the table.

Load data from FILE

Command to Load the data into the static partitioned table:

load data in partition - Partition in Hive
 
  1. LOAD DATA LOCAL INPATH ‘/user/data/employee.txt’ INTO TABLE partitioned_test_managed PARTITION (yearofexperience=3);

Here I have loaded the data by mentioning partition value 3 i.e. all the data will be in directory 3. The full path of the directory is user/hive/warehouse/partitioned_test_managed/yearofexperience=3

Likewise, you can load different data based on different partition value. It will create a new directory for new partition value.

Load Data from another hive table

Command to Load the data into the table:

insert data - Partition in Hive
 
  1. INSERT OVERWRITE TABLE partitioned_test_managed
  2. PARTITION(yearofexperience=3)
  3. SELECT empId,firstname,lastname,city,mobile FROM partitioned_temp temp
  4. WHERE temp.yearofexperience=3;

In this above query, if you notice we are selecting only non-portioned columns from temp table because in the partitioned_test_managed table we have only 5 columns and data will store in a yearofexperience partition.

Case II: Partition column is not a table column

Our source data having six columns (empId, firstname, lastname, city, mobile, yearofexperience), but we want to add any other partition column apart from these columns.

Let’s have a scenario where data is coming on a daily basis, in this case, we will store data on date wise directory. For this, we will add partition column named dateValue(taking data type string).

Create table query:

create table Partition in Hive
 
  1. CREATE TABLE partitioned_test_managed(
  2. empId int,
  3. firstname STRING,
  4. lastname STRING,
  5. city STRING,
  6. mobile STRING,
  7. yearofexperience INT
  8. ) PARTITIONED BY (dateValue STRING)
  9. ROW FORMAT DELIMITED
  10. FIELDS TERMINATED BY ','
  11. LINES TERMINATED BY '\n'
  12. STORED as TEXTFILE;

Load data from FILE

Command to Load the data into the table:

load data - Partition in Hive
 
  1. LOAD DATA LOCAL INPATH ‘/user/data/employee.txt’ INTO TABLE partitioned_test_managed PARTITION (dateValue=20160101);

Here, we have stored the data in date directory (2016/01/01). If you provide next date value and loading the data, it will create new date directory and will store the data.

Load data from another hive table

Command to Load the data into the table:

insert data - Partition in Hive
 
  1. INSERT OVERWRITE TABLE partitioned_test_managed
  2. PARTITION(dateValue=20160101)
  3. SELECT * FROM partitioned_temp;

Static partition: Implement on External table

PARTITION TABLE CREATION FORMAT:

create partition table - Partition in Hive
 
  1. CREATE TABLE <table_name>(
  2. Mention_all_non_partition_column_name_along with data type
  3. ) PARTITIONED BY (mention_partition_column_name_along_with_data_type)
  4. ROW FORMAT DELIMITED
  5. FIELDS TERMINATED BY ','
  6. LINES TERMINATED BY '\n'
  7. STORED as TEXTFILE
  8. LOCATION ‘<hdfs_location>’;

Case I: Partition column is a table column

Create external table:

create table - Partition in Hive
 
  1. CREATE TABLE partitioned_test_external(
  2. empId INT,
  3. firstname STRING,
  4. lastname STRING,
  5. city STRING,
  6. mobile STRING)
  7. PARTITIONED BY (yearofexperience INT)
  8. ROW FORMAT DELIMITED
  9. FIELDS TERMINATED BY ','
  10. LINES TERMINATED BY '\n'
  11. STORED as TEXTFILE LOCATION '/user/bdp/hiveData/yearofexperience=3';

Load data from FILE

We have partition data at location ‘/user/bdp/hiveData/yearofexperience=3’. So we have created an external hive table and mentioned the location of partition data path.

For loading data from hdfs file, it is required to add a partition to the table, so that it get an update in meta store. To add partition, will alter the table as shown below:

alter hive table - Partition in Hive
 
  1. ALTER TABLE partitioned_test_external ADD PARTITION (yearofexperience=3) LOCATION '/user/hiveData/yearofexperience=3';

Load data from another hive table

For loading data from another hive table, following query have to execute. Here, we are creating partition yearofexperience=5.

insert overwrite hiv table - Partition in Hive
 
  1. INSERT OVERWRITE TABLE partitioned_test_managed
  2. PARTITION(yearofexperience=5)
  3. SELECT empId,firstname,lastname,city,mobile FROM partitioned_temp temp
  4. WHERE temp.yearofexperience=5;

Dynamic partition

As we have seen in the Static partition while creating partition we were giving some static value for the partition column. In dynamic partition, the partition will happen dynamically i.e. it will create a partition based on the value of the partition column.

Create a table either managed or external table. I am creating external table:

drop and create hive table - Partition in Hive
 
  1. DROP TABLE partitioned_test_external;
  2. CREATE TABLE partitioned_test_external(
  3. empId INT,
  4. firstname STRING,
  5. lastname STRING,
  6. city STRING,
  7. mobile STRING)
  8. PARTITIONED BY (yearofexperience INT)
  9. ROW FORMAT DELIMITED
  10. FIELDS TERMINATED BY ','
  11. LINES TERMINATED BY '\n'
  12. STORED as TEXTFILE
  13. LOCATION '/user/bdp/hiveData';

I have deleted all the existing partition from the mentioned path. Now load the data from another non-partition table.

insert overwrite hive - Partition in Hive
 
  1. INSERT OVERWRITE TABLE partitioned_test_external
  2. PARTITION(yearofexperience)
  3. SELECT empId,firstname,lastname,city,mobile,yearofexperience FROM partitioned_temp;

Note: In the above query, I have mentioned the partition column (yearofexperience) name at last while selecting from temp table. Always mention partition column at last in dynamic partition otherwise it will do partition based on other column mentioned at last.

verify data in the table:

Wrapping Up

Partition in Hive plays an important role while storing the bulk of data. With the hive partitioned table, you can query on the specific bulk of data as it is available in the partition. These are the overall discussion about static and dynamic partition in the hive with the demo.

108
1

Join in hive with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in pyspark with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in spark using scala with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Java UDF to convert String to date in PIG

About Code Many times it happens like you have received data from many systems and each system operates on a ...
Read More
/ java udf, Pig, pig, pig udf, string to date, udf

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.