Filter records in pig

Filter records in pig

Requirement:

In source data, you have user’s information of mobile connection type and Id.You have four type of possible connection “POSTP, PREP, CLS, PEND” .But it is required to get Id of only those users whose connection type is in “POSTP, PREP, blank or null”. If the blank is present in data then you have to mention connection type as “NA”.

WHEN?

While processing data using Pig script you may come across a situation where you need to filter out unwanted records, which are either having insufficient information or business is not expecting that dataset into the final dataset.

Get the sample file from here .sample_3
INPUT RECORDS DELIMITER: “|”

Follow the steps:-

Step 1: Load the file

Load the text file data in pig, use below command. Change the location as per your environment.

 
 
  1. INPUT_RECORDS = LOAD '/root/local_bdp/posts/filter-records-in-pig/sample_3.txt' USING PigStorage('|') AS (id:chararray,contype:chararray);

Check whether the file is loaded successfully or not.

 
 
  1. DUMP INPUT_RECORDS

Please find below screenshot for reference.

Step 2: Filter the record as per the requirement.

Use below command.

 
 
  1. INTERMD_RECORDS = FILTER INPUT_RECORDS BY (contype=='POSTP' OR contype=='PREP' OR contype IS NULL OR contype=='' );

Here only those records which have connection type in “POSTP”,”PREP” and blank will come.Blank in a string can be meant to either null or empty string so contype=’’ is also written.
Let’s visualize it using below command.

 
 
  1. DUMP INTERMD_RECORDS

Step 3: Output

 
 
  1. OUTPUT_RECORDS = FOREACH INTERMD_RECORDS GENERATE id,((contype IS NULL OR contype=='')?'NA' :contype) AS contype;

Here we have assigned “NA” values to blank contype.

Let’s visualize it using below command.

 
 
  1. DUMP OUTPUT_RECORDS

If you compare above output with input, you would be able to see the difference that unwanted records are no longer present in the output.

Wrapping up

The filter should be used as early as possible while writing pig script, which minimizes the number of unwanted records to be processed.
Keep learning 🙂

2
0

Join in hive with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in pyspark with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in spark using scala with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Java UDF to convert String to date in PIG

About Code Many times it happens like you have received data from many systems and each system operates on a ...
Read More
/ java udf, Pig, pig, pig udf, string to date, udf

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.