Requirement
In this post, we are going to understand what is hive_default_partition in hive and why it gets created.
Components Involved
- HDFS
- HIVE
Sample Data
We will use below sample data for the task.
This is the sample data of employee details. Some employees are the member of company’s sports team like Cricket, Football, Basketball, Tennis, Badminton. Some employees are not a member of any sports team that records are highlighted in the above picture.
You can download the sample data from here.
sample_dataSolution
There are 2 types of partitions in hive – Static and Dynamic. We do assign a value to a partition column in static partition table whereas, in the dynamic partition, the value gets assigned to the partitioned column dynamically based on the data available in the table for the defined partition column. Read this post for more details about partitioning in hive.
The HIVE_DEFAULT_PARTITION in hive is represented by a NULL value of the partitioned column. That means, if we have a NULL value for a partition column and loading this record to a partitioned table, then hive_default_partition will get create for that record. Lets’ understand this with our sample data.
Step 1: Set Up
In this step, we are creating a hive table for loading the sample data. If you have already loaded the data into a table then skip this step and jump to the next step.
Let’s say we are having a hive table named emp_sports which stores employees details (sample data). We can create the table using below DDL and load the data into it:
CREATE TABLE db_bdpbase.emp_sports( id INT, firstname STRING, lastname STRING, sports STRING, city STRING, country STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE TBLPROPERTIES("skip.header.line.count"="1");
LOAD DATA INPATH '/usr/bdp/hive/sample_data.csv' INTO db_bdpbase.emp_sports;
Here, we have created a table and loaded the sample data.
Let’s check the data in this table:
select * from db_bdpbase.emp_sports limit 15;
Step 2: Create Partitioned Table
In this step, we are going to create a hive partition table which is partitioned by Sports column.
The DDL will look like:
CREATE TABLE DB_BDPBASE.DEFAULT_PARTITION_TEST( ID INT, FIRSTNAME STRING, LASTNAME STRING, CITY STRING, COUNTRY STRING ) PARTITIONED BY ( SPORTS STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
Step 3: Load data into Partitioned Table
In this step, we are inserting data from the base table into the created hive partitioned table.
INSERT OVERWRITE TABLE DB_BDPBASE.DEFAULT_PARTITION_TEST PARTITION (SPORTS) SELECT ID, FIRSTNAME, LASTNAME, CITY, COUNTRY, SPORTS from DB_BDPBASE.EMP_SPORTS;
Here, total 6 partitions have created which are Badminton, Basketball, Cricket, Football, Tennis, and HIVE_DEFAULT_PARTITION. The HIVE_DEFAULT_PARTITION partition has been created because of NULL value of the partitioned column. The records which are having a NULL value for sports are getting stored in HIVE_DEFAULT_PARTITION partition.
SHOW PARTITIONS db_bdpbase.DEFAULT_PARTITION_TEST;
Let’s check records of this default partition:
select * from db_bdpbase.DEFAULT_PARTITION_TEST where sports="__HIVE_DEFAULT_PARTITION__";
Wrapping Up
In this post, we understood what is the default partition in hive. This gets create when we load data into partitioned table dynamically and partition column has a NULL value.