Load CSV file into hive ORC table

Requirement

You have a comma separated file and you want to create an ORC formatted table in hive on top of it, then follow the below-mentioned steps.

Solution

Step 1: Sample CSV File

Create a sample CSV file named as sample_1.csv file.

Download from here sample_1

(You can skip this step if you already have a CSV file, just place it into the local directory.)

If you have any sample data with you, then put the content in that file with delimiter comma (,). If you have created a file in windows, then transfer it to your Linux machine via WinSCP.

You can refer below screenshot:

upload

I have a local directory named as input_files, so I have placed a sample_1.CSV file in that directory. You can see content of that file using below command

 cat /root/bigdataprogrammers/input_files/sample_1.csv

It will show the file content:-

Step 2:  Copy CSV to HDFS

Run Below commands in the shell for initial setup.

First, create a Hdfs directory named as ld_csv_hv and ip using below command:

 hadoop fs -mkdir bdp/ld_csv_hv
 hadoop fs -mkdir bdp/ld_csv_hv/ip

Put the file in Hdfs using below command:

 hadoop fs -put /root/bigdataprogrammers/input_files/sample_1.csv bdp/ld_csv_hv/ip/

Check whether the file is available in Hdfs or not using below command:

 hadoop fs -ls bdp/ld_csv_hv/ip/

NOTE: – For me, the default Hdfs directory is /user/root/

Step 3: Create temporary Hive Table and Load data

Now, you have a file in Hdfs, you just need to create an external table on top of it. Note that this is just a temporary table.

Use below hive script to create an external table named as csv_table in schema bdp. Run below script in hive CLI.

 CREATE SCHEMA IF NOT EXISTS bdp;
CREATE EXTERNAL TABLE IF NOT EXISTS bdp.hv_csv_table
(id STRING,Code STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'hdfs://sandbox.hortonworks.com:8020/user/root/bdp/ld_csv_hv/ip';

Step 4: Verify data

Please check whether CSV data is showing in a table or not using below command:

 select * from bdp.hv_csv_table;

Step 5: Create an ORC table

We have created a temporary table. Now it’s time to create a hive table which is in ORC format. The main advantage of an ORC format is to reduce the size of a table.
Below is the query to create an ORC table named as hv_orc in the hive.

CREATE TABLE bdp.hv_orc 
(
id STRING,
code STRING
) 
STORED AS ORC;

Note that we have mentioned ORC storage in the create query.

Step 6: Copy data from a temporary table.

As we have already loaded temporary table hv_csv_table, it’s time to load the data from it to actual ORC table hv_orc.
Use below query to copy the data:

INSERT INTO TABLE bdp.hv_orc SELECT * FROM bdp.hv_csv_table;

As we have created a temporary table on top of an external location ‘hdfs://sandbox.hortonworks.com:8020/user/root/bdp/ld_csv_hv/ip’.

As our target is accomplished, so we need to remove CSV data which is present at the above location. Use below command to delete the temporary data:

 hadoop fs –rm –r –f hdfs://sandbox.hortonworks.com:8020/user/root/bdp/ld_csv_hv/ip/*

The reason why we are removing this data is that we do not want actual data to take so much space in HDFS location, and for that reason only we have created an ORC table. Now we have data in ORC table only, so actually, we have decreased the file size and stored in HDFS, which will definitely help to reduce cost.

Step 6: Output

To see the data in hive table, go to hive CLI and paste the below query:

 select * from bdp.hv_orc;

Find below screenshot for reference.

Wrapping Up

ORC is the most used file format when it comes to minimizing the data storage cost. It stores data in ORC i.e Optimized Row Columnar format.

Don’t forget to subscribe us. Keep learning.

Don’t miss the tutorial on Top Big data courses on Udemy you should Buy

Read Interview QA on Big Data

 

Sharing is caring!

Subscribe to our newsletter
Loading

1 Comment

  1. Reddy Reddy

    Hi…. I am planning to create a table in cassandra depending on the csv file in hdfs

    I have choosen hive to do it

    1) I have uploaded a csv file(dataset.csv) into hdfs using the same commands

    hdfs dfs -ls /h1bdata/dataset/ gives me the path of a file

    2) Created a table in hive depending on the csv file and gave the same location

    but my table is giving empty output… please let me know if you need any other details.. thank u

Leave a Reply