Data migration from Hive to HBase

Data migration from Hive to HBase

Requirement

Suppose we have data in Hive table. We want the same data into HBase table. So, our requirement is to migrate the data from Hive to HBase table.

Components Involved

  • Hive – Source table
  • HBase – Target Table

Solution

We cannot load data directly into HBase table from the hive. In order to achieve the requirement, we have to go through the following steps:

Step 1: Create Hive table

If you already have hive table with data then jump to step 3.

We are creating this hive table as a source. This table data, we want in HBase table.

create-table
 
CREATE TABLE hive_table(
  empno int,
  ename string,
  designation string,
  manager int,
  sal int,
  deptno int)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '\t'
  LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

Step 2: Load data into Hive

Loading the data from the local path. In my case, the local path is /root/bdp/hbase/data/emp_data.csv.

load data
 
LOAD DATA LOCAL INPATH '/root/bdp/hbase/data/emp_data.csv' INTO TABLE hive_table;

Step 3: Create HBase-Hive Mapping table

In this step, we are creating another hive table which actually points to an HBase table.

hbase-hive--mapping
 
CREATE TABLE hbase_table_employee 
  ( 
     empno       INT, 
     ename       STRING, 
     designation STRING, 
     manager     INT, 
     hire_date   STRING, 
     sal         INT, 
     deptno      INT 
  )
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:ename,cf:designation,cf:manager,cf:hire_date,cf:sal,cf:deptno")
TBLPROPERTIES ("hbase.table.name" = "employee_hbase");

Here, we are specifying HBaseStorageHandler in Stored By option. Also, mapping the hive column with HBase column family using SERDEPROPERTIES. It will create an HBase table named employee_hbase which will point to this hive table.

Note: Key represents the first column of the hive table. So Id will become the key.

In case, you are already having an HBase table and want to load data into existing HBase table, then you have to use EXTERNAL in your above hive DDL. Otherwise, you will get an error:

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:MetaException(message:Table employee_hbase_2 already exists within HBase; use CREATE EXTERNAL TABLE instead to register it in Hive.)

 
 
CREATE EXTERNAL TABLE hbase_table_employee_2
  ( 
     empno       INT, 
     ename       STRING, 
     designation STRING, 
     manager     INT, 
     hire_date   STRING, 
     sal         INT, 
     deptno      INT 
  )
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:ename,cf:designation,cf:manager,cf:hire_date,cf:sal,cf:deptno")
TBLPROPERTIES ("hbase.table.name" = "employee_hbase_2");

Step 4: Load data into HBase from Hive

In this step, we are going to migrate hive table data to HBase. That means we will load the hive (created in step 1) data to hive table created in step 3.

 
 
INSERT INTO TABLE hbase_table_employee SELECT * FROM hive_table;

We have loaded data into hbase_table_employee table which is pointing to HBase table employee_hbase.

Step 5: Scan HBase Table

Let’s check the data in HBase table:

 
 
  1. hbase(main):008:0> scan 'employee_hbase'
  2. ROW COLUMN+CELL
  3. 7369 column=cf:deptno, timestamp=1514476352028, value=20
  4. 7369 column=cf:designation, timestamp=1514476352028, value=CLERK
  5. 7369 column=cf:ename, timestamp=1514476352028, value=SMITH
  6. 7369 column=cf:hire_date, timestamp=1514476352028, value=12/17/1980
  7. 7369 column=cf:manager, timestamp=1514476352028, value=7902
  8. 7369 column=cf:sal, timestamp=1514476352028, value=800
  9. 7499 column=cf:deptno, timestamp=1514476352028, value=30
  10. 7499 column=cf:designation, timestamp=1514476352028, value=SALESMAN
  11. 7499 column=cf:ename, timestamp=1514476352028, value=ALLEN
  12. 7499 column=cf:hire_date, timestamp=1514476352028, value=2/20/1981
  13. 7499 column=cf:manager, timestamp=1514476352028, value=7698
  14. 7499 column=cf:sal, timestamp=1514476352028, value=1600
  15. 7521 column=cf:deptno, timestamp=1514476352028, value=30
  16. 7521 column=cf:designation, timestamp=1514476352028, value=SALESMAN
  17. 7521 column=cf:ename, timestamp=1514476352028, value=WARD
  18. 7521 column=cf:hire_date, timestamp=1514476352028, value=2/22/1981
  19. 7521 column=cf:manager, timestamp=1514476352028, value=7698
  20. 7521 column=cf:sal, timestamp=1514476352028, value=1250
  21. 7566 column=cf:deptno, timestamp=1514476352028, value=20
  22. 7566 column=cf:designation, timestamp=1514476352028, value=MANAGER
  23. 7566 column=cf:ename, timestamp=1514476352028, value=TURNER
  24. 7566 column=cf:hire_date, timestamp=1514476352028, value=4/2/1981
  25. 7566 column=cf:manager, timestamp=1514476352028, value=7839
  26. 7566 column=cf:sal, timestamp=1514476352028, value=2975
  27. 7654 column=cf:deptno, timestamp=1514476352028, value=30
  28. 7654 column=cf:designation, timestamp=1514476352028, value=SALESMAN
  29. 7654 column=cf:ename, timestamp=1514476352028, value=MARTIN
  30. 7654 column=cf:hire_date, timestamp=1514476352028, value=9/28/1981
  31. 7654 column=cf:manager, timestamp=1514476352028, value=7698
  32. 7654 column=cf:sal, timestamp=1514476352028, value=1250
  33. 7698 column=cf:deptno, timestamp=1514476352028, value=30
  34. 7698 column=cf:designation, timestamp=1514476352028, value=MANAGER
  35. 7698 column=cf:ename, timestamp=1514476352028, value=MILLER
  36. 7698 column=cf:hire_date, timestamp=1514476352028, value=5/1/1981
  37. 7698 column=cf:manager, timestamp=1514476352028, value=7839
  38. 7698 column=cf:sal, timestamp=1514476352028, value=2850
  39. 7782 column=cf:deptno, timestamp=1514476352028, value=10
  40. 7782 column=cf:designation, timestamp=1514476352028, value=MANAGER
  41. 7782 column=cf:ename, timestamp=1514476352028, value=CLARK
  42. 7782 column=cf:hire_date, timestamp=1514476352028, value=6/9/1981
  43. 7782 column=cf:manager, timestamp=1514476352028, value=7839
  44. 7782 column=cf:sal, timestamp=1514476352028, value=2450
  45. 7788 column=cf:deptno, timestamp=1514476352028, value=20
  46. 7788 column=cf:designation, timestamp=1514476352028, value=ANALYST
  47. 7788 column=cf:ename, timestamp=1514476352028, value=SCOTT
  48. 7788 column=cf:hire_date, timestamp=1514476352028, value=12/9/1982
  49. 7788 column=cf:manager, timestamp=1514476352028, value=7566
  50. 7788 column=cf:sal, timestamp=1514476352028, value=3000
  51. 7839 column=cf:deptno, timestamp=1514476352028, value=10
  52. 7839 column=cf:designation, timestamp=1514476352028, value=PRESIDENT
  53. 7839 column=cf:ename, timestamp=1514476352028, value=KING
  54. 7839 column=cf:hire_date, timestamp=1514476352028, value=11/17/1981
  55. 7839 column=cf:sal, timestamp=1514476352028, value=5000
  56. 7844 column=cf:deptno, timestamp=1514476352028, value=30
  57. 7844 column=cf:designation, timestamp=1514476352028, value=SALESMAN
  58. 7844 column=cf:ename, timestamp=1514476352028, value=TURNER
  59. 7844 column=cf:hire_date, timestamp=1514476352028, value=9/8/1981
  60. 7844 column=cf:manager, timestamp=1514476352028, value=7698
  61. 7844 column=cf:sal, timestamp=1514476352028, value=1500
  62. 7876 column=cf:deptno, timestamp=1514476352028, value=20
  63. 7876 column=cf:designation, timestamp=1514476352028, value=CLERK
  64. 7876 column=cf:ename, timestamp=1514476352028, value=ADAMS
  65. 7876 column=cf:hire_date, timestamp=1514476352028, value=1/12/1983
  66. 7876 column=cf:manager, timestamp=1514476352028, value=7788
  67. 7876 column=cf:sal, timestamp=1514476352028, value=1100
  68. 7900 column=cf:deptno, timestamp=1514476352028, value=30
  69. 7900 column=cf:designation, timestamp=1514476352028, value=CLERK
  70. 7900 column=cf:ename, timestamp=1514476352028, value=JAMES
  71. 7900 column=cf:hire_date, timestamp=1514476352028, value=12/3/1981
  72. 7900 column=cf:manager, timestamp=1514476352028, value=7698
  73. 7900 column=cf:sal, timestamp=1514476352028, value=950
  74. 7902 column=cf:deptno, timestamp=1514476352028, value=20
  75. 7902 column=cf:designation, timestamp=1514476352028, value=ANALYST
  76. 7902 column=cf:ename, timestamp=1514476352028, value=FORD
  77. 7902 column=cf:hire_date, timestamp=1514476352028, value=12/3/1981
  78. 7902 column=cf:manager, timestamp=1514476352028, value=7566
  79. 7902 column=cf:sal, timestamp=1514476352028, value=3000
  80. 7934 column=cf:deptno, timestamp=1514476352028, value=10
  81. 7934 column=cf:designation, timestamp=1514476352028, value=CLERK
  82. 7934 column=cf:ename, timestamp=1514476352028, value=MILLER
  83. 7934 column=cf:hire_date, timestamp=1514476352028, value=1/23/1982
  84. 7934 column=cf:manager, timestamp=1514476352028, value=7782
  85. 7934 column=cf:sal, timestamp=1514476352028, value=1300
  86. 14 row(s) in 1.7580 seconds

Wrapping Up

In this post, we have created a hive to hbase mapping table in order to migrate data from hive to hbase. There is an HBase table on top of our Hive table. If your hive table contains a record which has NULL values for all the columns, in that case, hive and hbase records count would differ. The reason is, HBase table will ignore that record.

0
0

Join in hive with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in pyspark with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in spark using scala with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Java UDF to convert String to date in PIG

About Code Many times it happens like you have received data from many systems and each system operates on a ...
Read More
/ java udf, Pig, pig, pig udf, string to date, udf

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.