Requirement
You have comma separated(CSV) file and you want to create Avro table in hive on top of it, then follow below mentioned steps.
Solution
Step 1: Sample CSV File
Create a sample CSV file named as sample_1.csv file.
download from here sample_1
(You can skip this step if you already have a CSV file, just place it into local directory.)
Put content in that file, delimited by a comma (,). If you have created a file in windows then transfer it to your Linux machine via WinSCP.
Please refer below screenshot.
I have a local directory named as input_files, so I have placed a sample_1.CSV file in that directory. You can see content of that file using below command
cat /root/bigdataprogrammers/input_files/sample_1.csv
It will show the content of file:-
Step 2: Copy CSV to HDFS
Run Below commands in the shell for initial setup.
First, create a Hdfs directory ld_csv_hv and ip directory inside that using below command.
hadoop fs -mkdir bdp/ld_csv_hv hadoop fs -mkdir bdp/ld_csv_hv/ip
Put the file in Hdfs using below command
hadoop fs -put /root/bigdataprogrammers/input_files/sample_1.csv bdp/ld_csv_hv/ip/
Check whether file is available in Hdfs or not using below command
hadoop fs -ls bdp/ld_csv_hv/ip/
NOTE: – For me, the default Hdfs directory is /user/root/
Step 3: Create temporary Hive Table and Load data
Now you have file in Hdfs, you just need to create an external table on top of it.Note that this is just a temporary table.
Use below hive scripts to create an external table csv_table in schema bdp. Run below script in hive CLI.
CREATE SCHEMA IF NOT EXISTS bdp;
CREATE EXTERNAL TABLE IF NOT EXISTS bdp.hv_csv_table (id STRING,Code STRING) ROW FORMAT DELIMITEDFIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION 'hdfs://sandbox.hortonworks.com:8020/user/root/bdp/ld_csv_hv/ip';
Step 4: Verify data
Please check whether CSV data is showing in a table or not using below command.
select * from bdp.hv_csv_table;
Step 5: Create AVRO table
We have created the temporary table.Now it’s time to create a hive table which has AVRO format.
Below is the code of creation of AVRO table hv_avro in a hive.
CREATE TABLE bdp.hv_avro (id STRING,code STRING) STORED AS AVRO;
Note that we have mentioned AVRO in create a table.
Step 6: Copy data from a temporary table.
As we have already loaded temporary table hv_csv_table, it’s time to load the data from it to actual AVRO table hv_avro.
Use below code to copy the data.
INSERT INTO TABLE bdp.hv_avro SELECT * FROM bdp.hv_csv_table;
As we have created a temporary table on top of external location ‘hdfs://sandbox.hortonworks.com:8020/user/root/bdp/ld_csv_hv/ip’
And our target is accomplished, so we need to remove CSV data which is present at above location.Use below command to delete the temporary data .
hadoop fs –rm –r –f hdfs://sandbox.hortonworks.com:8020/user/root/bdp/ld_csv_hv/ip/*
The reason why we are removing this data is because we do not want actual data to take so much space in hdfs location.
Step 6: Output
To see the data in hive table go to hive prompt and paste below code
select * from bdp.hv_avro;
You can get the location of AVRO files using below command.
Describe formatted bdp.hv_avro;
Please find below screenshot for reference.
You can see the files in hdfs using below command.
hadoop fs -ls hdfs://sandbox-hdp.hortonworks.com:8020/apps/hive/warehouse/bdp.db/hv_avro;
That’s All!
Don’t forget to subscribe us. Keep learning.Keep Sharing.
Don’t miss the tutorial on Top Big data courses on Udemy you should Buy