Requirement
You have one CSV file which is present at Hdfs location, and you want to create a hive layer on top of this data, but CSV file is having two headers on top of it, and you don’t want them to come into your hive table, so let’s solve this.
Solution
Assume name of CSV file is sample_2.csv which is present at HDFS location ‘bdp/rmhd/ip/sample_2.csv’
Get the sample CSV file from here. sample_2
Let’s see the content of the file using below command
hadoop fs -cat bdp/rmhd/ip/sample_2.csv
Step 1: Create Table
For better understanding, let’s load the data with headers. So first create an external table (contains headers) in schema bdp with the above-mentioned location as an external location.
Use complete Hdfs location including name node at the beginning. I have Hdfs location as below ‘hdfs://sandbox.hortonworks.com:8020/user/root/bdp/rmhd/ip/’
Create schema if you wish to have a new one using below command:
Create schema IF NOT EXISTS bdp;
Use the below script to create a table:
CREATE EXTERNAL TABLE IF NOT EXISTS bdp.rm_hd_table (u_name STRING, idf BIGINT, Cn STRING, Ot STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'hdfs://sandbox.hortonworks.com:8020/user/root/bdp/rmhd/ip/';
Use below command to see the output:
select * from bdp.rm_hd_table;
Step 2: Remove Header
As you can see, in the above step that headers are present, so we have to add table properties to skip two headers.
Below is the script for removing the header.
CREATE EXTERNAL TABLE IF NOT EXISTS bdp.rmvd_hd_table (u_name STRING, idf BIGINT, Cn STRING, Ot STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'hdfs://sandbox.hortonworks.com:8020/user/root/bdp/rmhd/ip/' TBLPROPERTIES("skip.header.line.count"="2");
Now, you can see the output which no longer contains header as only 7 records are coming in the output.
Use below command to see the output:
select * from bdp.rmvd_hd_table;
Wrapping Up
Many times data exported from the database with the header. For the understanding purpose, the header is useful but for processing the data it is required to ignore the header.
Don’t miss the tutorial on Top Big data courses on Udemy you should Buy
Found it useful . I was trying to remove headers with shell script .It is much easier.