Requirement
You have comma separated(CSV) file and you want to create Parquet table in hive on top of it, then follow below mentioned steps.
Solution
Step 1: Sample CSV File
Create a sample CSV file named as sample_1.csv file.
download from here sample_1
(You can skip this step if you already have a CSV file, just place it into local directory.)
Put content in that file, delimited by a comma (,). If you have created a file in windows then transfer it to your Linux machine via WinSCP.
Please refer below screenshot.
I have a local directory named as input_files, so I have placed a sample_1.CSV file in that directory. You can see content of that file using below command
cat /root/bigdataprogrammers/input_files/sample_1.csv
It will show the content of file:-
Step 2: Copy CSV to HDFS
Run Below commands in the shell for initial setup.
First, create a Hdfs directory ld_csv_hv and ip directory inside that using below command.
hadoop fs -mkdir bdp/ld_csv_hv hadoop fs -mkdir bdp/ld_csv_hv/ip
Put the file in Hdfs using below command
hadoop fs -put /root/bigdataprogrammers/input_files/sample_1.csv bdp/ld_csv_hv/ip/
Check whether file is available in Hdfs or not using below command
hadoop fs -ls bdp/ld_csv_hv/ip/
NOTE: – For me, the default Hdfs directory is /user/root/
Step 3: Create temporary Hive Table and Load data
Now you have file in Hdfs, you just need to create an external table on top of it.Note that this is just a temporary table.
Use below hive scripts to create an external table csv_table in schema bdp. Run below script in hive CLI.
CREATE SCHEMA IF NOT EXISTS bdp;
CREATE EXTERNAL TABLE IF NOT EXISTS bdp.hv_csv_table (id STRING,Code STRING) ROW FORMAT DELIMITEDFIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION 'hdfs://sandbox.hortonworks.com:8020/user/root/bdp/ld_csv_hv/ip';
Step 4: Verify data
Please check whether CSV data is showing in a table or not using below command.
select * from bdp.hv_csv_table;
Step 5: Create Parquet table
We have created the temporary table.Now it’s time to create a hive table which has Parquet format.
Below is the code of creation of Parquet table hv_parq in a hive.
CREATE TABLE bdp.hv_parq (id STRING,code STRING) STORED AS PARQUET;
Note that we have mentioned PARQUET in create a table.
Step 6: Copy data from a temporary table.
As we have already loaded temporary table hv_csv_table, it’s time to load the data from it to actual PARQUET table hv_parq.
Use below code to copy the data.
INSERT INTO TABLE bdp.hv_parq SELECT * FROM bdp.hv_csv_table;
As we have created a temporary table on top of external location ‘hdfs://sandbox.hortonworks.com:8020/user/root/bdp/ld_csv_hv/ip’
And our target is accomplished, so we need to remove CSV data which is present at above location.Use below command to delete the temporary data .
hadoop fs –rm –r –f hdfs://sandbox.hortonworks.com:8020/user/root/bdp/ld_csv_hv/ip/*
The reason why we are removing this data is because we do not want actual data to take so much space in hdfs location, and for that reason only we have created an PARQUET table. Now we have data in PARQUET table only, so actually, we have decreased the file size and stored in hdfs which definitely helps to reduce cost.
Step 6: Output
To see the data in hive table go to hive prompt and paste below code
select * from bdp.hv_parq;
You can get the location of Parquet files using below command.
Describe formatted bdp.hv_parq;
Please find below screenshot for reference.
You can see the files in hdfs using below command.
hadoop fs -ls hdfs://sandbox-hdp.hortonworks.com:8020/apps/hive/warehouse/bdp.db/hv_parq;
Wrapping Up
Parquet supports very efficient compression and encoding schemes.This allows splitting columns into multiple files,In query if you use few columns rather than complete set of columns then query time can be very less.
Don’t miss the tutorial on Top Big data courses on Udemy you should Buy