Remove Header of CSV File in hive

Remove Header of CSV File in hive

Requirement

You have one CSV file which is present at Hdfs location, and you want to create a hive layer on top of this data, but CSV file is having two headers on top of it, and you don’t want them to come into your hive table, so let’s solve this.

You have one CSV file which is present at Hdfs location, and you want to create a hive layer on top of this data, but CSV file is having two headers on top of it, and you don’t want them to come into your hive table, so let’s solve this.

Solution

Assume name of CSV file is sample_2.csv which is present at HDFS location ‘bdp/rmhd/ip/sample_2.csv’

Get the sample CSV file from here. sample_2

Let’s see the content of file using below command

 
 
  1. hadoop fs -cat bdp/rmhd/ip/sample_2.csv

Step 1: Create Table

For better understanding let’s load the data with headers. So first create an external table (contains headers) mentioning above mentioned location as an external location in schema bdp.

Use complete Hdfs location including name node at the beginning. I have Hdfs location like ‘hdfs://sandbox.hortonworks.com:8020/user/root/bdp/rmhd/ip/’

Create schema if you wish to have new one using below command.

 
 
  1. Create schema IF NOT EXISTS bdp;

Use below script to create a table

 
 
  1. CREATE EXTERNAL TABLE IF NOT EXISTS bdp.rm_hd_table
  2. (u_name STRING,
  3. idf BIGINT,
  4. Cn STRING,
  5. Ot STRING)
  6. ROW FORMAT DELIMITED
  7. FIELDS TERMINATED BY '|'
  8. STORED AS TEXTFILE
  9. LOCATION 'hdfs://sandbox.hortonworks.com:8020/user/root/bdp/rmhd/ip/';

Use below command to see the output.

 
 
  1. select * from bdp.rm_hd_table;

Step 2: Remove Header

As you can see In above step that headers are present so we have to add table properties to skip header count =2.

Below is the script for removing the header.

 
 
  1. CREATE EXTERNAL TABLE IF NOT EXISTS bdp.rmvd_hd_table
  2. (u_name STRING,
  3. idf BIGINT,
  4. Cn STRING,
  5. Ot STRING)
  6. ROW FORMAT DELIMITED
  7. FIELDS TERMINATED BY '|'
  8. STORED AS TEXTFILE
  9. LOCATION 'hdfs://sandbox.hortonworks.com:8020/user/root/bdp/rmhd/ip/'
  10. TBLPROPERTIES("skip.header.line.count"="2");

Now you can see the output which no longer contains header as only 7 records are coming in the output.

Use below command to see the output.

 
 
  1. select * from bdp.rmvd_hd_table;

Wrapping Up

Many times data exported from the database with the header. For the understanding purpose the header is useful but for processing the data it is required to ignore the header.

29
0

Join in hive with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in pyspark with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in spark using scala with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Java UDF to convert String to date in PIG

About Code Many times it happens like you have received data from many systems and each system operates on a ...
Read More
/ java udf, Pig, pig, pig udf, string to date, udf

1 Comment

  1. Found it useful . I was trying to remove headers with shell script .It is much easier.

    1

    0

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.