Remove Header of CSV File in hive

Requirement

You have one CSV file which is present at Hdfs location, and you want to create a hive layer on top of this data, but CSV file is having two headers on top of it, and you don’t want them to come into your hive table, so let’s solve this.

You have one CSV file which is present at Hdfs location, and you want to create a hive layer on top of this data, but CSV file is having two headers on top of it, and you don’t want them to come into your hive table, so let’s solve this.

Solution

Assume name of CSV file is sample_2.csv which is present at HDFS location ‘bdp/rmhd/ip/sample_2.csv’

Get the sample CSV file from here. sample_2

Let’s see the content of file using below command

 
 
  1. hadoop fs -cat bdp/rmhd/ip/sample_2.csv

Step 1: Create Table

For better understanding let’s load the data with headers. So first create an external table (contains headers) mentioning above mentioned location as an external location in schema bdp.

Use complete Hdfs location including name node at the beginning. I have Hdfs location like ‘hdfs://sandbox.hortonworks.com:8020/user/root/bdp/rmhd/ip/’

Create schema if you wish to have new one using below command.

 
 
  1. Create schema IF NOT EXISTS bdp;

Use below script to create a table

 
 
  1. CREATE EXTERNAL TABLE IF NOT EXISTS bdp.rm_hd_table
  2. (u_name STRING,
  3. idf BIGINT,
  4. Cn STRING,
  5. Ot STRING)
  6. ROW FORMAT DELIMITED
  7. FIELDS TERMINATED BY '|'
  8. STORED AS TEXTFILE
  9. LOCATION 'hdfs://sandbox.hortonworks.com:8020/user/root/bdp/rmhd/ip/';

Use below command to see the output.

 
 
  1. select * from bdp.rm_hd_table;

Step 2: Remove Header

As you can see In above step that headers are present so we have to add table properties to skip header count =2.

Below is the script for removing the header.

 
 
  1. CREATE EXTERNAL TABLE IF NOT EXISTS bdp.rmvd_hd_table
  2. (u_name STRING,
  3. idf BIGINT,
  4. Cn STRING,
  5. Ot STRING)
  6. ROW FORMAT DELIMITED
  7. FIELDS TERMINATED BY '|'
  8. STORED AS TEXTFILE
  9. LOCATION 'hdfs://sandbox.hortonworks.com:8020/user/root/bdp/rmhd/ip/'
  10. TBLPROPERTIES("skip.header.line.count"="2");

Now you can see the output which no longer contains header as only 7 records are coming in the output.

Use below command to see the output.

 
 
  1. select * from bdp.rmvd_hd_table;

Wrapping Up

Many times data exported from the database with the header. For the understanding purpose the header is useful but for processing the data it is required to ignore the header.

Load CSV file into hive AVRO table

Requirement You have comma separated(CSV) file and you want to create Avro table in hive on top of it, then ...
Read More

Load CSV file into hive PARQUET table

Requirement You have comma separated(CSV) file and you want to create Parquet table in hive on top of it, then ...
Read More

Hive Most Asked Interview Questions With Answers – Part II

What is bucketing and what is the use of it? Answer: Bucket is an optimisation technique which is used to ...
Read More
/ hive, hive interview, interview-qa

Spark Interview Questions Part-1

Suppose you have a spark dataframe which contains millions of records. You need to perform multiple actions on it. How ...
Read More

1 Comment

  1. Found it useful . I was trying to remove headers with shell script .It is much easier.

Leave a Reply