Requirement
Suppose we have data in Hive table. We want the same data into HBase table. So, our requirement is to migrate the data from Hive to HBase table.
Components Involved
- Hive – Source table
- HBase – Target Table
Solution
We cannot load data directly into HBase table from the hive. In order to achieve the requirement, we have to go through the following steps:
Step 1: Create Hive table
If you already have hive table with data then jump to step 3.
We are creating this hive table as a source. This table data, we want in HBase table.
- CREATE TABLE hive_table(
- empno INT,
- ename string,
- designation string,
- manager INT,
- sal INT,
- deptno INT)
- ROW FORMAT DELIMITED
- FIELDS TERMINATED BY '\t'
- LINES TERMINATED BY '\n'
- STORED AS TEXTFILE;
Step 2: Load data into Hive
Loading the data from the local path. In my case, the local path is /root/bdp/hbase/data/emp_data.csv.
- LOAD DATA LOCAL INPATH '/root/bdp/hbase/data/emp_data.csv' INTO TABLE hive_table;
Step 3: Create HBase-Hive Mapping table
In this step, we are creating another hive table which actually points to an HBase table.
- CREATE TABLE hbase_table_employee
- (
- empno INT,
- ename STRING,
- designation STRING,
- manager INT,
- hire_date STRING,
- sal INT,
- deptno INT
- )
- STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
- WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:ename,cf:designation,cf:manager,cf:hire_date,cf:sal,cf:deptno")
- TBLPROPERTIES ("hbase.table.name" = "employee_hbase");
Here, we are specifying HBaseStorageHandler in Stored By option. Also, mapping the hive column with HBase column family using SERDEPROPERTIES. It will create an HBase table named employee_hbase which will point to this hive table.
Note: Key represents the first column of the hive table. So Id will become the key.
In case, you are already having an HBase table and want to load data into existing HBase table, then you have to use EXTERNAL in your above hive DDL. Otherwise, you will get an error:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:MetaException(message:Table employee_hbase_2 already exists within HBase; use CREATE EXTERNAL TABLE instead to register it in Hive.)
- CREATE EXTERNAL TABLE hbase_table_employee_2
- (
- empno INT,
- ename STRING,
- designation STRING,
- manager INT,
- hire_date STRING,
- sal INT,
- deptno INT
- )
- STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
- WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:ename,cf:designation,cf:manager,cf:hire_date,cf:sal,cf:deptno")
- TBLPROPERTIES ("hbase.table.name" = "employee_hbase_2");
Step 4: Load data into HBase from Hive
In this step, we are going to migrate hive table data to HBase. That means we will load the hive (created in step 1) data to hive table created in step 3.
- INSERT INTO TABLE hbase_table_employee SELECT * FROM hive_table;
We have loaded data into hbase_table_employee table which is pointing to HBase table employee_hbase.
Step 5: Scan HBase Table
Let’s check the data in HBase table:
- hbase(main):008:0> scan 'employee_hbase'
- ROW COLUMN+CELL
- 7369 column=cf:deptno, timestamp=1514476352028, value=20
- 7369 column=cf:designation, timestamp=1514476352028, value=CLERK
- 7369 column=cf:ename, timestamp=1514476352028, value=SMITH
- 7369 column=cf:hire_date, timestamp=1514476352028, value=12/17/1980
- 7369 column=cf:manager, timestamp=1514476352028, value=7902
- 7369 column=cf:sal, timestamp=1514476352028, value=800
- 7499 column=cf:deptno, timestamp=1514476352028, value=30
- 7499 column=cf:designation, timestamp=1514476352028, value=SALESMAN
- 7499 column=cf:ename, timestamp=1514476352028, value=ALLEN
- 7499 column=cf:hire_date, timestamp=1514476352028, value=2/20/1981
- 7499 column=cf:manager, timestamp=1514476352028, value=7698
- 7499 column=cf:sal, timestamp=1514476352028, value=1600
- 7521 column=cf:deptno, timestamp=1514476352028, value=30
- 7521 column=cf:designation, timestamp=1514476352028, value=SALESMAN
- 7521 column=cf:ename, timestamp=1514476352028, value=WARD
- 7521 column=cf:hire_date, timestamp=1514476352028, value=2/22/1981
- 7521 column=cf:manager, timestamp=1514476352028, value=7698
- 7521 column=cf:sal, timestamp=1514476352028, value=1250
- 7566 column=cf:deptno, timestamp=1514476352028, value=20
- 7566 column=cf:designation, timestamp=1514476352028, value=MANAGER
- 7566 column=cf:ename, timestamp=1514476352028, value=TURNER
- 7566 column=cf:hire_date, timestamp=1514476352028, value=4/2/1981
- 7566 column=cf:manager, timestamp=1514476352028, value=7839
- 7566 column=cf:sal, timestamp=1514476352028, value=2975
- 7654 column=cf:deptno, timestamp=1514476352028, value=30
- 7654 column=cf:designation, timestamp=1514476352028, value=SALESMAN
- 7654 column=cf:ename, timestamp=1514476352028, value=MARTIN
- 7654 column=cf:hire_date, timestamp=1514476352028, value=9/28/1981
- 7654 column=cf:manager, timestamp=1514476352028, value=7698
- 7654 column=cf:sal, timestamp=1514476352028, value=1250
- 7698 column=cf:deptno, timestamp=1514476352028, value=30
- 7698 column=cf:designation, timestamp=1514476352028, value=MANAGER
- 7698 column=cf:ename, timestamp=1514476352028, value=MILLER
- 7698 column=cf:hire_date, timestamp=1514476352028, value=5/1/1981
- 7698 column=cf:manager, timestamp=1514476352028, value=7839
- 7698 column=cf:sal, timestamp=1514476352028, value=2850
- 7782 column=cf:deptno, timestamp=1514476352028, value=10
- 7782 column=cf:designation, timestamp=1514476352028, value=MANAGER
- 7782 column=cf:ename, timestamp=1514476352028, value=CLARK
- 7782 column=cf:hire_date, timestamp=1514476352028, value=6/9/1981
- 7782 column=cf:manager, timestamp=1514476352028, value=7839
- 7782 column=cf:sal, timestamp=1514476352028, value=2450
- 7788 column=cf:deptno, timestamp=1514476352028, value=20
- 7788 column=cf:designation, timestamp=1514476352028, value=ANALYST
- 7788 column=cf:ename, timestamp=1514476352028, value=SCOTT
- 7788 column=cf:hire_date, timestamp=1514476352028, value=12/9/1982
- 7788 column=cf:manager, timestamp=1514476352028, value=7566
- 7788 column=cf:sal, timestamp=1514476352028, value=3000
- 7839 column=cf:deptno, timestamp=1514476352028, value=10
- 7839 column=cf:designation, timestamp=1514476352028, value=PRESIDENT
- 7839 column=cf:ename, timestamp=1514476352028, value=KING
- 7839 column=cf:hire_date, timestamp=1514476352028, value=11/17/1981
- 7839 column=cf:sal, timestamp=1514476352028, value=5000
- 7844 column=cf:deptno, timestamp=1514476352028, value=30
- 7844 column=cf:designation, timestamp=1514476352028, value=SALESMAN
- 7844 column=cf:ename, timestamp=1514476352028, value=TURNER
- 7844 column=cf:hire_date, timestamp=1514476352028, value=9/8/1981
- 7844 column=cf:manager, timestamp=1514476352028, value=7698
- 7844 column=cf:sal, timestamp=1514476352028, value=1500
- 7876 column=cf:deptno, timestamp=1514476352028, value=20
- 7876 column=cf:designation, timestamp=1514476352028, value=CLERK
- 7876 column=cf:ename, timestamp=1514476352028, value=ADAMS
- 7876 column=cf:hire_date, timestamp=1514476352028, value=1/12/1983
- 7876 column=cf:manager, timestamp=1514476352028, value=7788
- 7876 column=cf:sal, timestamp=1514476352028, value=1100
- 7900 column=cf:deptno, timestamp=1514476352028, value=30
- 7900 column=cf:designation, timestamp=1514476352028, value=CLERK
- 7900 column=cf:ename, timestamp=1514476352028, value=JAMES
- 7900 column=cf:hire_date, timestamp=1514476352028, value=12/3/1981
- 7900 column=cf:manager, timestamp=1514476352028, value=7698
- 7900 column=cf:sal, timestamp=1514476352028, value=950
- 7902 column=cf:deptno, timestamp=1514476352028, value=20
- 7902 column=cf:designation, timestamp=1514476352028, value=ANALYST
- 7902 column=cf:ename, timestamp=1514476352028, value=FORD
- 7902 column=cf:hire_date, timestamp=1514476352028, value=12/3/1981
- 7902 column=cf:manager, timestamp=1514476352028, value=7566
- 7902 column=cf:sal, timestamp=1514476352028, value=3000
- 7934 column=cf:deptno, timestamp=1514476352028, value=10
- 7934 column=cf:designation, timestamp=1514476352028, value=CLERK
- 7934 column=cf:ename, timestamp=1514476352028, value=MILLER
- 7934 column=cf:hire_date, timestamp=1514476352028, value=1/23/1982
- 7934 column=cf:manager, timestamp=1514476352028, value=7782
- 7934 column=cf:sal, timestamp=1514476352028, value=1300
- 14 row(s) in 1.7580 seconds
Wrapping Up
In this post, we have created a hive to hbase mapping table in order to migrate data from hive to hbase. There is an HBase table on top of our Hive table. If your hive table contains a record which has NULL values for all the columns, in that case, hive and hbase records count would differ. The reason is, HBase table will ignore that record.
Join in hive with example
Read More
Join in pyspark with example
Read More
Join in spark using scala with example
Read More