Requirement
In this post, we are having data in a CSV file. This file contains basic information about Employees. We want to import CSV data into HBase table.
Components Involved
- HDFS
- HBASE
Sample Data
Our sample data looks like below:
7369,SMITH,CLERK,7902,12/17/1980,800,20 7499,ALLEN,SALESMAN,7698,2/20/1981,1600,30 7521,WARD,SALESMAN,7698,2/22/1981,1250,30 7566,TURNER,MANAGER,7839,4/2/1981,2975,20 7654,MARTIN,SALESMAN,7698,9/28/1981,1250,30 7698,MILLER,MANAGER,7839,5/1/1981,2850,30 7782,CLARK,MANAGER,7839,6/9/1981,2450,10 7788,SCOTT,ANALYST,7566,12/9/1982,3000,20 7839,KING,PRESIDENT,NULL,11/17/1981,5000,10 7844,TURNER,SALESMAN,7698,9/8/1981,1500,30 7876,ADAMS,CLERK,7788,1/12/1983,1100,20 7900,JAMES,CLERK,7698,12/3/1981,950,30 7902,FORD,ANALYST,7566,12/3/1981,3000,20 7934,MILLER,CLERK,7782,1/23/1982,1300,10
You can download the sample data from this link emp_data.
Solutions
We will go step-by-step in order to achieve our requirement with the given input.
Step 1: Input Data Preparation
Once we have the data in CSV format, we have to store it at a path from where it can get access by HBase that means we will keep input data in HDFS location. Currently, I am having a data file in local path, we will copy this to HDFS location using the command. In my case, I am copying file from local ‘/root/bdp/hbase/data‘ to HDFS ‘/user/bdp/hbase/data‘ Path.
hadoop fs -copyFromLocal /root/bdp/hbase/data/*.csv /user/bdp/hbase/data
Step 2: Create HBase Table
In this step, we will create an HBase table to store the data. Let’s create this table based on the data we have in CSV file. Open HBase console using HBase shell and execute the query:
create 'emp_data',{NAME => 'cf'}
Here, we have created an HBase table named emp_data with column family cf. All the columns of the CSV data will be store in cf family.
Step 3: Load data into HBase
We are done with data preparation and table creation. Now we have to load the data from HDFS to HBase. We will use below query:
cd /usr/hdp/2.4.0.0-169/hbase
bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=’,’ -Dimporttsv.columns=’HBASE_ROW_KEY,cf:ename,cf:designation,cf:manager,cf:hire_date,cf:sal,cf:deptno’ emp_data /user/bdp/hbase/data/emp_data.csv
Once we submit the job, a MapReduce job will get started. Let’s understand each argument in more details:
-Dimporttsv.separator – Specify the delimiter of the source file
-Dimporttsv.columns – Mentioned columns name. Here if you observe, we have not mentioned empno. The Rowkey will have empno value. The row key needs to be identified using the all-caps HBASE_ROW_KEY string; otherwise, it won’t start the job.
Step 4: Read data from HBase
Let’s check the data in HBase table.
hbase(main):043:0> scan 'emp_data' ROW COLUMN+CELL 7369 column=cf:deptno, timestamp=1514460522175, value=20 7369 column=cf:designation, timestamp=1514460522175, value=CLERK 7369 column=cf:ename, timestamp=1514460522175, value=SMITH 7369 column=cf:hire_date, timestamp=1514460522175, value=12/17/1980 7369 column=cf:manager, timestamp=1514460522175, value=7902 7369 column=cf:sal, timestamp=1514460522175, value=800 7499 column=cf:deptno, timestamp=1514460522175, value=30 7499 column=cf:designation, timestamp=1514460522175, value=SALESMAN 7499 column=cf:ename, timestamp=1514460522175, value=ALLEN 7499 column=cf:hire_date, timestamp=1514460522175, value=2/20/1981 7499 column=cf:manager, timestamp=1514460522175, value=7698 7499 column=cf:sal, timestamp=1514460522175, value=1600 7521 column=cf:deptno, timestamp=1514460522175, value=30 7521 column=cf:designation, timestamp=1514460522175, value=SALESMAN 7521 column=cf:ename, timestamp=1514460522175, value=WARD 7521 column=cf:hire_date, timestamp=1514460522175, value=2/22/1981 7521 column=cf:manager, timestamp=1514460522175, value=7698 7521 column=cf:sal, timestamp=1514460522175, value=1250 7566 column=cf:deptno, timestamp=1514460522175, value=20 7566 column=cf:designation, timestamp=1514460522175, value=MANAGER 7566 column=cf:ename, timestamp=1514460522175, value=TURNER 7566 column=cf:hire_date, timestamp=1514460522175, value=4/2/1981 7566 column=cf:manager, timestamp=1514460522175, value=7839 7566 column=cf:sal, timestamp=1514460522175, value=2975 7654 column=cf:deptno, timestamp=1514460522175, value=30 7654 column=cf:designation, timestamp=1514460522175, value=SALESMAN 7654 column=cf:ename, timestamp=1514460522175, value=MARTIN 7654 column=cf:hire_date, timestamp=1514460522175, value=9/28/1981 7654 column=cf:manager, timestamp=1514460522175, value=7698 7654 column=cf:sal, timestamp=1514460522175, value=1250 7698 column=cf:deptno, timestamp=1514460522175, value=30 7698 column=cf:designation, timestamp=1514460522175, value=MANAGER 7698 column=cf:ename, timestamp=1514460522175, value=MILLER 7698 column=cf:hire_date, timestamp=1514460522175, value=5/1/1981 7698 column=cf:manager, timestamp=1514460522175, value=7839 7698 column=cf:sal, timestamp=1514460522175, value=2850 7782 column=cf:deptno, timestamp=1514460522175, value=10 7782 column=cf:designation, timestamp=1514460522175, value=MANAGER 7782 column=cf:ename, timestamp=1514460522175, value=CLARK 7782 column=cf:hire_date, timestamp=1514460522175, value=6/9/1981 7782 column=cf:manager, timestamp=1514460522175, value=7839 7782 column=cf:sal, timestamp=1514460522175, value=2450 7788 column=cf:deptno, timestamp=1514460522175, value=20 7788 column=cf:designation, timestamp=1514460522175, value=ANALYST 7788 column=cf:ename, timestamp=1514460522175, value=SCOTT 7788 column=cf:hire_date, timestamp=1514460522175, value=12/9/1982 7788 column=cf:manager, timestamp=1514460522175, value=7566 7788 column=cf:sal, timestamp=1514460522175, value=3000 7839 column=cf:deptno, timestamp=1514460522175, value=10 7839 column=cf:designation, timestamp=1514460522175, value=PRESIDENT 7839 column=cf:ename, timestamp=1514460522175, value=KING 7839 column=cf:hire_date, timestamp=1514460522175, value=11/17/1981 7839 column=cf:manager, timestamp=1514460522175, value=NULL 7839 column=cf:sal, timestamp=1514460522175, value=5000 7844 column=cf:deptno, timestamp=1514460522175, value=30 7844 column=cf:designation, timestamp=1514460522175, value=SALESMAN 7844 column=cf:ename, timestamp=1514460522175, value=TURNER 7844 column=cf:hire_date, timestamp=1514460522175, value=9/8/1981 7844 column=cf:manager, timestamp=1514460522175, value=7698 7844 column=cf:sal, timestamp=1514460522175, value=1500 7876 column=cf:deptno, timestamp=1514460522175, value=20 7876 column=cf:designation, timestamp=1514460522175, value=CLERK 7876 column=cf:ename, timestamp=1514460522175, value=ADAMS 7876 column=cf:hire_date, timestamp=1514460522175, value=1/12/1983 7876 column=cf:manager, timestamp=1514460522175, value=7788 7876 column=cf:sal, timestamp=1514460522175, value=1100 7900 column=cf:deptno, timestamp=1514460522175, value=30 7900 column=cf:designation, timestamp=1514460522175, value=CLERK 7900 column=cf:ename, timestamp=1514460522175, value=JAMES 7900 column=cf:hire_date, timestamp=1514460522175, value=12/3/1981 7900 column=cf:manager, timestamp=1514460522175, value=7698 7900 column=cf:sal, timestamp=1514460522175, value=950 7902 column=cf:deptno, timestamp=1514460522175, value=20 7902 column=cf:designation, timestamp=1514460522175, value=ANALYST 7902 column=cf:ename, timestamp=1514460522175, value=FORD 7902 column=cf:hire_date, timestamp=1514460522175, value=12/3/1981 7902 column=cf:manager, timestamp=1514460522175, value=7566 7902 column=cf:sal, timestamp=1514460522175, value=3000 7934 column=cf:deptno, timestamp=1514460522175, value=10 7934 column=cf:designation, timestamp=1514460522175, value=CLERK 7934 column=cf:ename, timestamp=1514460522175, value=MILLER 7934 column=cf:hire_date, timestamp=1514460522175, value=1/23/1982 7934 column=cf:manager, timestamp=1514460522175, value=7782 7934 column=cf:sal, timestamp=1514460522175, value=1300 14 row(s) in 1.0290 seconds
Wrapping Up
In this post, we have seen the approach to load the data, which is in CSV format, to HBase table. We have used ImportTSV for the task. The advantage of using HBase for storage is fast operation as compared to other components.
Hallo, my name is ozie. Can you help me? I have a problem when importing data into hbase table. I’ve tried to use importtsv, but the problem is the number of columns in my file very much (1000 columns). Do I have to write all the columns or is there another way that can automatically increase the number of columns according to the file?
Thank you…