Assume that you want to load a file having timestamp values (yyyy-MM-dd HH:mm:ss) into pig. After loading into pig add one day into each timestamp value.
Please follow the below steps:-
Step 1: Sample file
Create a sample file named as timestamp_sample.txt file and put timestamp values in format ‘yyyy-MM-dd HH:mm:ss’ or you can get the file from here timestamp_sample
I have a local directory as /root/local_bdp/posts/load-timestamp-values-from-file-into-pig/, so I have placed a timestamp_sample.txt file in that directory.
Place this file into Hdfs location, I am creating ts_pig directory in Hdfs location bdp/posts/
- hadoop fs -mkdir bdp/posts/ts_pig
- hadoop fs -put /root/local_bdp/posts/load-timestamp-values-from-file-into-pig/timestamp_sample.txt bdp/posts/ts_pig
After executing make sure that file has transferred successfully.
Step 2: Load Timestamp values
Now I will load the file in pig, Enter a below-mentioned command in putty.
It will take you to the grunt shell. Type the below-mentioned command in grunt shell.
- Input_File = Load 'bdp/posts/ts_pig/timestamp_sample.txt' using PigStorage(',') As (dt:chararray);
You can see whether the file is loaded or not using below command:-
- dump Input_File
It will show the content of relation Input_File.
The Reason why I have used chararray as datatype of dt column is ,because if you use datetime as datatype then it will result null values in the output.so best way is to get all values in chararray and later convert them in to datetime using below command.
- Timestamp_data = Foreach Input_File Generate ToDate(dt,'yyyy-MM-dd HH:mm:ss') AS mydt;
ToDate() is built-in function which takes chararray as first argument and second argument is the type of format you are passing into chararray. As we are dealing with timestamp so we are using ‘yyyy-MM-dd HH:mm:ss’ format.
To see the output of relation Timestamp_data use below command.
- Dump Timestamp_data
In above output you can see that it is showing values in datetime format, which means that now you can perform any kind of date operation on this dataset.
Suppose you want to add one day in each value then refer below step.
Step 3: Add one day in datetime
As we have datetime datatype, we can use built-in function AddDuration().
So second argument would be P0Y0M1DT0H0m0s, we have written 1D and all other values are 0,so in the output you will get +1 day in all values.
Use below command to achieve the same.
- Add_Timestamp_data = Foreach Timestamp_data Generate mydt,AddDuration(mydt,'P0Y0M1DT0H0m0s') As newmydt;
Now you can see in the output that there is difference of 1 day between mydt (old) and newmydt(new).
Many times we receive files from relational database, which have timestamp values, but in order to transform them using pig the datatype must be datetime,which can be achieve using above approach .
keep learning .subscribe us for more tutorials.