Import CSV data into HBase

Requirement

In this post, we are having data in a CSV file. This file contains basic information about Employees. We want to import CSV data into HBase table.

Components Involved

  • HDFS
  • HBASE

Sample Data

Our sample data looks like below:

7369,SMITH,CLERK,7902,12/17/1980,800,20
7499,ALLEN,SALESMAN,7698,2/20/1981,1600,30
7521,WARD,SALESMAN,7698,2/22/1981,1250,30
7566,TURNER,MANAGER,7839,4/2/1981,2975,20
7654,MARTIN,SALESMAN,7698,9/28/1981,1250,30
7698,MILLER,MANAGER,7839,5/1/1981,2850,30
7782,CLARK,MANAGER,7839,6/9/1981,2450,10
7788,SCOTT,ANALYST,7566,12/9/1982,3000,20
7839,KING,PRESIDENT,NULL,11/17/1981,5000,10
7844,TURNER,SALESMAN,7698,9/8/1981,1500,30
7876,ADAMS,CLERK,7788,1/12/1983,1100,20
7900,JAMES,CLERK,7698,12/3/1981,950,30
7902,FORD,ANALYST,7566,12/3/1981,3000,20
7934,MILLER,CLERK,7782,1/23/1982,1300,10

You can download the sample data from this link emp_data.

Solutions

We will go step-by-step in order to achieve our requirement with the given input.

Step 1: Input Data Preparation

Once we have the data in CSV format, we have to store it at a path from where it can get access by HBase that means we will keep input data in HDFS location. Currently, I am having a data file in local path, we will copy this to HDFS location using the command. In my case, I am copying file from local ‘/root/bdp/hbase/data‘ to HDFS ‘/user/bdp/hbase/data‘ Path.

 hadoop fs -copyFromLocal /root/bdp/hbase/data/*.csv /user/bdp/hbase/data

Step 2: Create HBase Table

In this step, we will create an HBase table to store the data. Let’s create this table based on the data we have in CSV file. Open HBase console using HBase shell and execute the query:

 create 'emp_data',{NAME => 'cf'}

Here, we have created an HBase table named emp_data with column family cf. All the columns of the CSV data will be store in cf family.

Step 3: Load data into HBase

We are done with data preparation and table creation. Now we have to load the data from HDFS to HBase. We will use below query:

cd /usr/hdp/2.4.0.0-169/hbase

bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=’,’ -Dimporttsv.columns=’HBASE_ROW_KEY,cf:ename,cf:designation,cf:manager,cf:hire_date,cf:sal,cf:deptno’ emp_data /user/bdp/hbase/data/emp_data.csv

Once we submit the job, a MapReduce job will get started. Let’s understand each argument in more details:

-Dimporttsv.separator – Specify the delimiter of the source file

-Dimporttsv.columns – Mentioned columns name. Here if you observe, we have not mentioned empno. The Rowkey will have empno value. The row key needs to be identified using the all-caps HBASE_ROW_KEY string; otherwise, it won’t start the job.

Step 4: Read data from HBase

Let’s check the data in HBase table.

hbase(main):043:0> scan 'emp_data'
ROW                             COLUMN+CELL
 7369                           column=cf:deptno, timestamp=1514460522175, value=20
 7369                           column=cf:designation, timestamp=1514460522175, value=CLERK
 7369                           column=cf:ename, timestamp=1514460522175, value=SMITH
 7369                           column=cf:hire_date, timestamp=1514460522175, value=12/17/1980
 7369                           column=cf:manager, timestamp=1514460522175, value=7902
 7369                           column=cf:sal, timestamp=1514460522175, value=800
 7499                           column=cf:deptno, timestamp=1514460522175, value=30
 7499                           column=cf:designation, timestamp=1514460522175, value=SALESMAN
 7499                           column=cf:ename, timestamp=1514460522175, value=ALLEN
 7499                           column=cf:hire_date, timestamp=1514460522175, value=2/20/1981
 7499                           column=cf:manager, timestamp=1514460522175, value=7698
 7499                           column=cf:sal, timestamp=1514460522175, value=1600
 7521                           column=cf:deptno, timestamp=1514460522175, value=30
 7521                           column=cf:designation, timestamp=1514460522175, value=SALESMAN
 7521                           column=cf:ename, timestamp=1514460522175, value=WARD
 7521                           column=cf:hire_date, timestamp=1514460522175, value=2/22/1981
 7521                           column=cf:manager, timestamp=1514460522175, value=7698
 7521                           column=cf:sal, timestamp=1514460522175, value=1250
 7566                           column=cf:deptno, timestamp=1514460522175, value=20
 7566                           column=cf:designation, timestamp=1514460522175, value=MANAGER
 7566                           column=cf:ename, timestamp=1514460522175, value=TURNER
 7566                           column=cf:hire_date, timestamp=1514460522175, value=4/2/1981
 7566                           column=cf:manager, timestamp=1514460522175, value=7839
 7566                           column=cf:sal, timestamp=1514460522175, value=2975
 7654                           column=cf:deptno, timestamp=1514460522175, value=30
 7654                           column=cf:designation, timestamp=1514460522175, value=SALESMAN
 7654                           column=cf:ename, timestamp=1514460522175, value=MARTIN
 7654                           column=cf:hire_date, timestamp=1514460522175, value=9/28/1981
 7654                           column=cf:manager, timestamp=1514460522175, value=7698
 7654                           column=cf:sal, timestamp=1514460522175, value=1250
 7698                           column=cf:deptno, timestamp=1514460522175, value=30
 7698                           column=cf:designation, timestamp=1514460522175, value=MANAGER
 7698                           column=cf:ename, timestamp=1514460522175, value=MILLER
 7698                           column=cf:hire_date, timestamp=1514460522175, value=5/1/1981
 7698                           column=cf:manager, timestamp=1514460522175, value=7839
 7698                           column=cf:sal, timestamp=1514460522175, value=2850
 7782                           column=cf:deptno, timestamp=1514460522175, value=10
 7782                           column=cf:designation, timestamp=1514460522175, value=MANAGER
 7782                           column=cf:ename, timestamp=1514460522175, value=CLARK
 7782                           column=cf:hire_date, timestamp=1514460522175, value=6/9/1981
 7782                           column=cf:manager, timestamp=1514460522175, value=7839
 7782                           column=cf:sal, timestamp=1514460522175, value=2450
 7788                           column=cf:deptno, timestamp=1514460522175, value=20
 7788                           column=cf:designation, timestamp=1514460522175, value=ANALYST
 7788                           column=cf:ename, timestamp=1514460522175, value=SCOTT
 7788                           column=cf:hire_date, timestamp=1514460522175, value=12/9/1982
 7788                           column=cf:manager, timestamp=1514460522175, value=7566
 7788                           column=cf:sal, timestamp=1514460522175, value=3000
 7839                           column=cf:deptno, timestamp=1514460522175, value=10
 7839                           column=cf:designation, timestamp=1514460522175, value=PRESIDENT
 7839                           column=cf:ename, timestamp=1514460522175, value=KING
 7839                           column=cf:hire_date, timestamp=1514460522175, value=11/17/1981
 7839                           column=cf:manager, timestamp=1514460522175, value=NULL
 7839                           column=cf:sal, timestamp=1514460522175, value=5000
 7844                           column=cf:deptno, timestamp=1514460522175, value=30
 7844                           column=cf:designation, timestamp=1514460522175, value=SALESMAN
 7844                           column=cf:ename, timestamp=1514460522175, value=TURNER
 7844                           column=cf:hire_date, timestamp=1514460522175, value=9/8/1981
 7844                           column=cf:manager, timestamp=1514460522175, value=7698
 7844                           column=cf:sal, timestamp=1514460522175, value=1500
 7876                           column=cf:deptno, timestamp=1514460522175, value=20
 7876                           column=cf:designation, timestamp=1514460522175, value=CLERK
 7876                           column=cf:ename, timestamp=1514460522175, value=ADAMS
 7876                           column=cf:hire_date, timestamp=1514460522175, value=1/12/1983
 7876                           column=cf:manager, timestamp=1514460522175, value=7788
 7876                           column=cf:sal, timestamp=1514460522175, value=1100
 7900                           column=cf:deptno, timestamp=1514460522175, value=30
 7900                           column=cf:designation, timestamp=1514460522175, value=CLERK
 7900                           column=cf:ename, timestamp=1514460522175, value=JAMES
 7900                           column=cf:hire_date, timestamp=1514460522175, value=12/3/1981
 7900                           column=cf:manager, timestamp=1514460522175, value=7698
 7900                           column=cf:sal, timestamp=1514460522175, value=950
 7902                           column=cf:deptno, timestamp=1514460522175, value=20
 7902                           column=cf:designation, timestamp=1514460522175, value=ANALYST
 7902                           column=cf:ename, timestamp=1514460522175, value=FORD
 7902                           column=cf:hire_date, timestamp=1514460522175, value=12/3/1981
 7902                           column=cf:manager, timestamp=1514460522175, value=7566
 7902                           column=cf:sal, timestamp=1514460522175, value=3000
 7934                           column=cf:deptno, timestamp=1514460522175, value=10
 7934                           column=cf:designation, timestamp=1514460522175, value=CLERK
 7934                           column=cf:ename, timestamp=1514460522175, value=MILLER
 7934                           column=cf:hire_date, timestamp=1514460522175, value=1/23/1982
 7934                           column=cf:manager, timestamp=1514460522175, value=7782
 7934                           column=cf:sal, timestamp=1514460522175, value=1300
14 row(s) in 1.0290 seconds

Wrapping Up

In this post, we have seen the approach to load the data, which is in CSV format, to HBase table. We have used ImportTSV for the task. The advantage of using HBase for storage is fast operation as compared to other components.

Sharing is caring!

Subscribe to our newsletter
Loading

1 Comment

  1. Ozi Cik

    Hallo, my name is ozie. Can you help me? I have a problem when importing data into hbase table. I’ve tried to use importtsv, but the problem is the number of columns in my file very much (1000 columns). Do I have to write all the columns or is there another way that can automatically increase the number of columns according to the file?

    Thank you…

Leave a Reply