Import CSV data into HBase

Import CSV data into HBase

Requirement

In this post, we are having data in a CSV file. This file contains basic information about Employees. We want to import CSV data into HBase table.

Components Involved

  • HDFS
  • HBASE

Sample Data

Our sample data looks like below:

sample-data
 
  1. 7369,SMITH,CLERK,7902,12/17/1980,800,20
  2. 7499,ALLEN,SALESMAN,7698,2/20/1981,1600,30
  3. 7521,WARD,SALESMAN,7698,2/22/1981,1250,30
  4. 7566,TURNER,MANAGER,7839,4/2/1981,2975,20
  5. 7654,MARTIN,SALESMAN,7698,9/28/1981,1250,30
  6. 7698,MILLER,MANAGER,7839,5/1/1981,2850,30
  7. 7782,CLARK,MANAGER,7839,6/9/1981,2450,10
  8. 7788,SCOTT,ANALYST,7566,12/9/1982,3000,20
  9. 7839,KING,PRESIDENT,NULL,11/17/1981,5000,10
  10. 7844,TURNER,SALESMAN,7698,9/8/1981,1500,30
  11. 7876,ADAMS,CLERK,7788,1/12/1983,1100,20
  12. 7900,JAMES,CLERK,7698,12/3/1981,950,30
  13. 7902,FORD,ANALYST,7566,12/3/1981,3000,20
  14. 7934,MILLER,CLERK,7782,1/23/1982,1300,10

You can download the sample data from this link emp_data.

Solutions

We will go step-by-step in order to achieve our requirement with the given input.

Step 1: Input Data Preparation

Once we have the data in CSV format, we have to store it at a path from where it can get access by HBase that means we will keep input data in HDFS location. Currently, I am having a data file in local path, we will copy this to HDFS location using the command. In my case, I am copying file from local ‘/root/bdp/hbase/data‘ to HDFS ‘/user/bdp/hbase/data‘ Path.

copy
 
hadoop fs -copyFromLocal /root/bdp/hbase/data/*.csv /user/bdp/hbase/data

Step 2: Create HBase Table

In this step, we will create an HBase table to store the data. Let’s create this table based on the data we have in CSV file. Open HBase console using HBase shell and execute the query:

create hbase table
 
  1. create 'emp_data',{NAME => 'cf'}

Here, we have created an HBase table named emp_data with column family cf. All the columns of the CSV data will be store in cf family.

Step 3: Load data into HBase

We are done with data preparation and table creation. Now we have to load the data from HDFS to HBase. We will use below query:

cd /usr/hdp/2.4.0.0-169/hbase

bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=’,’ -Dimporttsv.columns=’HBASE_ROW_KEY,cf:ename,cf:designation,cf:manager,cf:hire_date,cf:sal,cf:deptno’ emp_data /user/bdp/hbase/data/emp_data.csv

Once we submit the job, a MapReduce job will get started. Let’s understand each argument in more details:

-Dimporttsv.separator – Specify the delimiter of the source file

-Dimporttsv.columns – Mentioned columns name. Here if you observe, we have not mentioned empno. The Rowkey will have empno value. The row key needs to be identified using the all-caps HBASE_ROW_KEY string; otherwise, it won’t start the job.

Step 4: Read data from HBase

Let’s check the data in HBase table.

 
 
  1. hbase(main):043:0> scan 'emp_data'
  2. ROW COLUMN+CELL
  3. 7369 column=cf:deptno, timestamp=1514460522175, value=20
  4. 7369 column=cf:designation, timestamp=1514460522175, value=CLERK
  5. 7369 column=cf:ename, timestamp=1514460522175, value=SMITH
  6. 7369 column=cf:hire_date, timestamp=1514460522175, value=12/17/1980
  7. 7369 column=cf:manager, timestamp=1514460522175, value=7902
  8. 7369 column=cf:sal, timestamp=1514460522175, value=800
  9. 7499 column=cf:deptno, timestamp=1514460522175, value=30
  10. 7499 column=cf:designation, timestamp=1514460522175, value=SALESMAN
  11. 7499 column=cf:ename, timestamp=1514460522175, value=ALLEN
  12. 7499 column=cf:hire_date, timestamp=1514460522175, value=2/20/1981
  13. 7499 column=cf:manager, timestamp=1514460522175, value=7698
  14. 7499 column=cf:sal, timestamp=1514460522175, value=1600
  15. 7521 column=cf:deptno, timestamp=1514460522175, value=30
  16. 7521 column=cf:designation, timestamp=1514460522175, value=SALESMAN
  17. 7521 column=cf:ename, timestamp=1514460522175, value=WARD
  18. 7521 column=cf:hire_date, timestamp=1514460522175, value=2/22/1981
  19. 7521 column=cf:manager, timestamp=1514460522175, value=7698
  20. 7521 column=cf:sal, timestamp=1514460522175, value=1250
  21. 7566 column=cf:deptno, timestamp=1514460522175, value=20
  22. 7566 column=cf:designation, timestamp=1514460522175, value=MANAGER
  23. 7566 column=cf:ename, timestamp=1514460522175, value=TURNER
  24. 7566 column=cf:hire_date, timestamp=1514460522175, value=4/2/1981
  25. 7566 column=cf:manager, timestamp=1514460522175, value=7839
  26. 7566 column=cf:sal, timestamp=1514460522175, value=2975
  27. 7654 column=cf:deptno, timestamp=1514460522175, value=30
  28. 7654 column=cf:designation, timestamp=1514460522175, value=SALESMAN
  29. 7654 column=cf:ename, timestamp=1514460522175, value=MARTIN
  30. 7654 column=cf:hire_date, timestamp=1514460522175, value=9/28/1981
  31. 7654 column=cf:manager, timestamp=1514460522175, value=7698
  32. 7654 column=cf:sal, timestamp=1514460522175, value=1250
  33. 7698 column=cf:deptno, timestamp=1514460522175, value=30
  34. 7698 column=cf:designation, timestamp=1514460522175, value=MANAGER
  35. 7698 column=cf:ename, timestamp=1514460522175, value=MILLER
  36. 7698 column=cf:hire_date, timestamp=1514460522175, value=5/1/1981
  37. 7698 column=cf:manager, timestamp=1514460522175, value=7839
  38. 7698 column=cf:sal, timestamp=1514460522175, value=2850
  39. 7782 column=cf:deptno, timestamp=1514460522175, value=10
  40. 7782 column=cf:designation, timestamp=1514460522175, value=MANAGER
  41. 7782 column=cf:ename, timestamp=1514460522175, value=CLARK
  42. 7782 column=cf:hire_date, timestamp=1514460522175, value=6/9/1981
  43. 7782 column=cf:manager, timestamp=1514460522175, value=7839
  44. 7782 column=cf:sal, timestamp=1514460522175, value=2450
  45. 7788 column=cf:deptno, timestamp=1514460522175, value=20
  46. 7788 column=cf:designation, timestamp=1514460522175, value=ANALYST
  47. 7788 column=cf:ename, timestamp=1514460522175, value=SCOTT
  48. 7788 column=cf:hire_date, timestamp=1514460522175, value=12/9/1982
  49. 7788 column=cf:manager, timestamp=1514460522175, value=7566
  50. 7788 column=cf:sal, timestamp=1514460522175, value=3000
  51. 7839 column=cf:deptno, timestamp=1514460522175, value=10
  52. 7839 column=cf:designation, timestamp=1514460522175, value=PRESIDENT
  53. 7839 column=cf:ename, timestamp=1514460522175, value=KING
  54. 7839 column=cf:hire_date, timestamp=1514460522175, value=11/17/1981
  55. 7839 column=cf:manager, timestamp=1514460522175, value=NULL
  56. 7839 column=cf:sal, timestamp=1514460522175, value=5000
  57. 7844 column=cf:deptno, timestamp=1514460522175, value=30
  58. 7844 column=cf:designation, timestamp=1514460522175, value=SALESMAN
  59. 7844 column=cf:ename, timestamp=1514460522175, value=TURNER
  60. 7844 column=cf:hire_date, timestamp=1514460522175, value=9/8/1981
  61. 7844 column=cf:manager, timestamp=1514460522175, value=7698
  62. 7844 column=cf:sal, timestamp=1514460522175, value=1500
  63. 7876 column=cf:deptno, timestamp=1514460522175, value=20
  64. 7876 column=cf:designation, timestamp=1514460522175, value=CLERK
  65. 7876 column=cf:ename, timestamp=1514460522175, value=ADAMS
  66. 7876 column=cf:hire_date, timestamp=1514460522175, value=1/12/1983
  67. 7876 column=cf:manager, timestamp=1514460522175, value=7788
  68. 7876 column=cf:sal, timestamp=1514460522175, value=1100
  69. 7900 column=cf:deptno, timestamp=1514460522175, value=30
  70. 7900 column=cf:designation, timestamp=1514460522175, value=CLERK
  71. 7900 column=cf:ename, timestamp=1514460522175, value=JAMES
  72. 7900 column=cf:hire_date, timestamp=1514460522175, value=12/3/1981
  73. 7900 column=cf:manager, timestamp=1514460522175, value=7698
  74. 7900 column=cf:sal, timestamp=1514460522175, value=950
  75. 7902 column=cf:deptno, timestamp=1514460522175, value=20
  76. 7902 column=cf:designation, timestamp=1514460522175, value=ANALYST
  77. 7902 column=cf:ename, timestamp=1514460522175, value=FORD
  78. 7902 column=cf:hire_date, timestamp=1514460522175, value=12/3/1981
  79. 7902 column=cf:manager, timestamp=1514460522175, value=7566
  80. 7902 column=cf:sal, timestamp=1514460522175, value=3000
  81. 7934 column=cf:deptno, timestamp=1514460522175, value=10
  82. 7934 column=cf:designation, timestamp=1514460522175, value=CLERK
  83. 7934 column=cf:ename, timestamp=1514460522175, value=MILLER
  84. 7934 column=cf:hire_date, timestamp=1514460522175, value=1/23/1982
  85. 7934 column=cf:manager, timestamp=1514460522175, value=7782
  86. 7934 column=cf:sal, timestamp=1514460522175, value=1300
  87. 14 row(s) in 1.0290 seconds

Wrapping Up

In this post, we have seen the approach to load the data, which is in CSV format, to HBase table. We have used ImportTSV for the task. The advantage of using HBase for storage is fast operation as compared to other components.

3
0

Join in hive with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in pyspark with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in spark using scala with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Java UDF to convert String to date in PIG

About Code Many times it happens like you have received data from many systems and each system operates on a ...
Read More
/ java udf, Pig, pig, pig udf, string to date, udf

1 Comment

  1. Ozi Cik

    Hallo, my name is ozie. Can you help me? I have a problem when importing data into hbase table. I’ve tried to use importtsv, but the problem is the number of columns in my file very much (1000 columns). Do I have to write all the columns or is there another way that can automatically increase the number of columns according to the file?

    Thank you…

    0

    0

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.