Remove Header of CSV File in hive

Requirement

You have one CSV file which is present at Hdfs location, and you want to create a hive layer on top of this data, but CSV file is having two headers on top of it, and you don’t want them to come into your hive table, so let’s solve this.

Solution

Assume name of CSV file is sample_2.csv which is present at HDFS location ‘bdp/rmhd/ip/sample_2.csv’

Get the sample CSV file from here. sample_2

Let’s see the content of the file using below command

 hadoop fs -cat bdp/rmhd/ip/sample_2.csv

Step 1: Create Table

For better understanding, let’s load the data with headers. So first create an external table (contains headers) in schema bdp with the above-mentioned location as an external location.

Use complete Hdfs location including name node at the beginning. I have Hdfs location as below ‘hdfs://sandbox.hortonworks.com:8020/user/root/bdp/rmhd/ip/’

Create schema if you wish to have a new one using below command:

 Create schema IF NOT EXISTS bdp;

Use the below script to create a table:

 CREATE EXTERNAL TABLE IF NOT EXISTS bdp.rm_hd_table
(u_name STRING,
idf BIGINT,
Cn STRING,
Ot STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION 'hdfs://sandbox.hortonworks.com:8020/user/root/bdp/rmhd/ip/';

Use below command to see the output:

 select * from bdp.rm_hd_table;

Step 2: Remove Header

As you can see, in the above step that headers are present, so we have to add table properties to skip two headers.

Below is the script for removing the header.

 CREATE EXTERNAL TABLE IF NOT EXISTS bdp.rmvd_hd_table
(u_name STRING,
idf BIGINT,
Cn STRING,
Ot STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION 'hdfs://sandbox.hortonworks.com:8020/user/root/bdp/rmhd/ip/'
TBLPROPERTIES("skip.header.line.count"="2");

Now, you can see the output which no longer contains header as only 7 records are coming in the output.

Use below command to see the output:

 select * from bdp.rmvd_hd_table;

Wrapping Up

Many times data exported from the database with the header. For the understanding purpose, the header is useful but for processing the data it is required to ignore the header.

Don’t miss the tutorial on Top Big data courses on Udemy you should Buy

Sharing is caring!

Subscribe to our newsletter
Loading

1 Comment

  1. Found it useful . I was trying to remove headers with shell script .It is much easier.

Leave a Reply