Filter records in pig

Requirement:

In source data, you have user’s information of mobile connection type and Id. You have four types of possible connection “POSTP, PREP, CLS, PEND”. But it is required to get Id of only those users whose connection type is in “POSTP, PREP, blank or null”. If the blank is present in data, then you have to mention connection type as “NA”.

WHEN?

While processing data using Pig script, you may come across a situation where you need to filter out unwanted records, which are either having insufficient information or business is not expecting that dataset into the result dataset.

Get the sample file from here.sample_3
INPUT RECORDS DELIMITER: “|”

Follow the steps:-

Step 1: Load the file

Load the text file data in pig, use below command. Change the location as per your environment.

INPUT_RECORDS = LOAD '/root/local_bdp/posts/filter-records-in-pig/sample_3.txt' USING PigStorage('|') AS (id:chararray,contype:chararray);

Check whether the file is loaded successfully or not.

 DUMP INPUT_RECORDS

Please find below screenshot for reference.

Step 2: Filter the record as per the requirement

Use below command:

INTERMD_RECORDS = FILTER INPUT_RECORDS BY (contype=='POSTP' OR contype=='PREP' OR contype IS NULL OR contype=='' );

Here, only those records will come which are having connection type in “POSTP”, “PREP” and blank. Blank in a string can be meant to either null or empty string, so contype=” is also written.
Let’s visualize it using below command:

 DUMP INTERMD_RECORDS

Step 3: Output

 OUTPUT_RECORDS = FOREACH INTERMD_RECORDS GENERATE id,((contype IS NULL OR contype=='')?'NA' :contype) AS contype;

Here, we have assigned “NA” value to blank contype.

Let’s visualize it using below command:

 DUMP OUTPUT_RECORDS

If you compare the above output with input, you would be able to see the difference that unwanted records are no longer present in the output.

Wrapping up

The filter should be used as early as possible while writing pig script, which minimizes the number of unwanted records to be processed.

Don’t miss the tutorial on Top Big data courses on Udemy you should Buy

Sharing is caring!

Subscribe to our newsletter
Loading

Leave a Reply