Load timestamp values from file into pig

 Requirement

Assume that you want to load a file having timestamp values (yyyy-MM-dd HH:mm:ss) into pig. After loading into pig add one day into each timestamp value.

Solution

Please follow the below steps:-

Step 1: Sample file

Create a sample file named as timestamp_sample.txt file and put timestamp values in format ‘yyyy-MM-dd HH:mm:ss’ or you can get the file from here timestamp_sample

I have a local directory as /root/local_bdp/posts/load-timestamp-values-from-file-into-pig/, so I have placed a  timestamp_sample.txt file in that directory.

Place this file into Hdfs location, I am creating ts_pig directory in Hdfs location bdp/posts/

 
 
  1. hadoop fs -mkdir bdp/posts/ts_pig
  2. hadoop fs -put /root/local_bdp/posts/load-timestamp-values-from-file-into-pig/timestamp_sample.txt bdp/posts/ts_pig

After executing make sure that file has transferred successfully.

Step 2: Load Timestamp values

Now I will load the file in pig, Enter a below-mentioned command in putty.

 
 
  1.  pig

It will take you to the grunt shell. Type the below-mentioned command in grunt shell.

 
 
  1. Input_File = Load 'bdp/posts/ts_pig/timestamp_sample.txt' using PigStorage(',') As (dt:chararray);

You can see whether the file is loaded or not using below command:-

 
 
  1. dump Input_File

It will show the content of relation Input_File.

The Reason why I have used chararray as datatype of dt column is ,because if you use datetime as datatype then it will result null values in the output.so best way is to get all values in chararray and later convert them in to datetime using below command.

 
 
  1. Timestamp_data = Foreach Input_File Generate ToDate(dt,'yyyy-MM-dd HH:mm:ss') AS mydt;

ToDate() is built-in function which takes chararray as first argument and second argument is the type of format you are passing into chararray. As we are dealing with timestamp so we are using ‘yyyy-MM-dd HH:mm:ss’  format.

To see the output of relation Timestamp_data use below command.

 
 
  1. Dump Timestamp_data

In above output you can see that it is showing values in datetime format, which means that now you can perform any kind of date operation on this dataset.

Suppose you want to add one day in each value then refer below step.

Step 3: Add one day in datetime

As we have datetime datatype, we can use built-in function AddDuration().

So second argument would be P0Y0M1DT0H0m0s, we have written 1D and all other values are 0,so in the output you will get +1 day in all values.

Use below command to achieve the same.

 
 
  1. Add_Timestamp_data = Foreach Timestamp_data Generate mydt,AddDuration(mydt,'P0Y0M1DT0H0m0s') As newmydt;

Now you can see in the output that there is difference of 1 day between mydt (old) and newmydt(new).

Wrapping Up

Many times we receive files from relational database, which have timestamp values, but in order to transform them using pig the datatype must be datetime,which can be achieve using above approach .

keep learning .subscribe us for more tutorials.

 

Load CSV file into hive AVRO table

Requirement You have comma separated(CSV) file and you want to create Avro table in hive on top of it, then ...
Read More

Load CSV file into hive PARQUET table

Requirement You have comma separated(CSV) file and you want to create Parquet table in hive on top of it, then ...
Read More

Hive Most Asked Interview Questions With Answers – Part II

What is bucketing and what is the use of it? Answer: Bucket is an optimisation technique which is used to ...
Read More
/ hive, hive interview, interview-qa

Spark Interview Questions Part-1

Suppose you have a spark dataframe which contains millions of records. You need to perform multiple actions on it. How ...
Read More

Leave a Reply