Load timestamp values from file into pig

 Requirement

Assume that you want to load a file having timestamp values (yyyy-MM-dd HH:mm:ss) into Pig. After loading into Pig, add one day into each timestamp value.

Solution

Follow the below steps:-

Step 1: Sample file

Create a sample file named as timestamp_sample.txt file and put timestamp values in format ‘yyyy-MM-dd HH:mm:ss’ or you can get the file from here timestamp_sample

I have a local directory as /root/local_bdp/posts/load-timestamp-values-from-file-into-pig/, so I have placed a  timestamp_sample.txt file in that directory.

Copy this file into Hdfs location, I am creating ts_pig directory in Hdfs location bdp/posts/

 hadoop fs -mkdir bdp/posts/ts_pig

hadoop fs -put /root/local_bdp/posts/load-timestamp-values-from-file-into-pig/timestamp_sample.txt bdp/posts/ts_pig

After executing make sure that file has transferred successfully.

Step 2: Load Timestamp values

Now, I will load the file in pig, Enter a below-mentioned command in terminal:

  pig

It will take you to the grunt shell. Use the below-mentioned command in grunt shell:

 Input_File = Load 'bdp/posts/ts_pig/timestamp_sample.txt' using PigStorage(',') As (dt:chararray);

You can see whether the file is loaded or not using below command:-

 dump Input_File

It will show the content of relation Input_File.

The reason why I have used chararray as the datatype of dt column is, because if you use datetime as datatype, then it will result null values in the output. So, the best way is to get all values in chararray and later convert them into datetime using below command:

 Timestamp_data = Foreach Input_File Generate ToDate(dt,'yyyy-MM-dd HH:mm:ss') AS mydt;

ToDate() is a built-in function which takes chararray as a first argument and the second argument is the type of format you are passing into chararray. As we are dealing with timestamp so we are using ‘yyyy-MM-dd HH:mm:ss’  format.

To see the output of relation Timestamp_data use below command:

 Dump Timestamp_data

In the above output, you can see that it is showing values in datetime format, which means that now you can perform any kind of date operation on this dataset.

Suppose you want to add one day in each value then refer below step.

Step 3: Add one day in datetime

As we have datetime datatype, we can use built-in function AddDuration().

So, the second argument would be P0Y0M1DT0H0m0s, we have written 1D and all other values are 0, so in the output, you will get +1 day in all values.

Use below command to achieve the same:

 Add_Timestamp_data = Foreach Timestamp_data Generate mydt,AddDuration(mydt,'P0Y0M1DT0H0m0s') As newmydt;

Now you can see in the output that there is a difference of 1 day between mydt (old) and newmydt(new).

Wrapping Up

Many times we receive files from the relational database, which have timestamp values, but in order to transform them using Pig the datatype must be datetime, which can be achieved using the above approach.

keep learning.

Don’t miss the tutorial on Top Big data courses on Udemy you should Buy

Sharing is caring!

Subscribe to our newsletter
Loading

Leave a Reply