Load timestamp values from file into pig

Load timestamp values from file into pig

 Requirement

Assume that you want to load a file having timestamp values (yyyy-MM-dd HH:mm:ss) into pig. After loading into pig add one day into each timestamp value.

Solution

Please follow the below steps:-

Step 1: Sample file

Create a sample file named as timestamp_sample.txt file and put timestamp values in format ‘yyyy-MM-dd HH:mm:ss’ or you can get the file from here timestamp_sample

I have a local directory as /root/local_bdp/posts/load-timestamp-values-from-file-into-pig/, so I have placed a  timestamp_sample.txt file in that directory.

Place this file into Hdfs location, I am creating ts_pig directory in Hdfs location bdp/posts/

 
 
hadoop fs -mkdir bdp/posts/ts_pig
hadoop fs -put /root/local_bdp/posts/load-timestamp-values-from-file-into-pig/timestamp_sample.txt bdp/posts/ts_pig

After executing make sure that file has transferred successfully.

Step 2: Load Timestamp values

Now I will load the file in pig, Enter a below-mentioned command in putty.

 
 
 pig

It will take you to the grunt shell. Type the below-mentioned command in grunt shell.

 
 
  1. Input_File = Load 'bdp/posts/ts_pig/timestamp_sample.txt' using PigStorage(',') As (dt:chararray);

You can see whether the file is loaded or not using below command:-

 
 
  1. dump Input_File

It will show the content of relation Input_File.

The Reason why I have used chararray as datatype of dt column is ,because if you use datetime as datatype then it will result null values in the output.so best way is to get all values in chararray and later convert them in to datetime using below command.

 
 
  1. Timestamp_data = Foreach Input_File Generate ToDate(dt,'yyyy-MM-dd HH:mm:ss') AS mydt;

ToDate() is built-in function which takes chararray as first argument and second argument is the type of format you are passing into chararray. As we are dealing with timestamp so we are using ‘yyyy-MM-dd HH:mm:ss’  format.

To see the output of relation Timestamp_data use below command.

 
 
  1. Dump Timestamp_data

In above output you can see that it is showing values in datetime format, which means that now you can perform any kind of date operation on this dataset.

Suppose you want to add one day in each value then refer below step.

Step 3: Add one day in datetime

As we have datetime datatype, we can use built-in function AddDuration().

So second argument would be P0Y0M1DT0H0m0s, we have written 1D and all other values are 0,so in the output you will get +1 day in all values.

Use below command to achieve the same.

 
 
  1. Add_Timestamp_data = Foreach Timestamp_data Generate mydt,AddDuration(mydt,'P0Y0M1DT0H0m0s') As newmydt;

Now you can see in the output that there is difference of 1 day between mydt (old) and newmydt(new).

Wrapping Up

Many times we receive files from relational database, which have timestamp values, but in order to transform them using pig the datatype must be datetime,which can be achieve using above approach .

keep learning .subscribe us for more tutorials.

 

17
0

Join in hive with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in pyspark with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in spark using scala with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Java UDF to convert String to date in PIG

About Code Many times it happens like you have received data from many systems and each system operates on a ...
Read More
/ java udf, Pig, pig, pig udf, string to date, udf

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.