Filter records in pig

Requirement:

In source data, you have user’s information of mobile connection type and Id.You have four type of possible connection “POSTP, PREP, CLS, PEND” .But it is required to get Id of only those users whose connection type is in “POSTP, PREP, blank or null”. If the blank is present in data then you have to mention connection type as “NA”.

WHEN?

While processing data using Pig script you may come across a situation where you need to filter out unwanted records, which are either having insufficient information or business is not expecting that dataset into the final dataset.

Get the sample file from here .sample_3
INPUT RECORDS DELIMITER: “|”

Follow the steps:-

Step 1: Load the file

Load the text file data in pig, use below command. Change the location as per your environment.

 
 
  1. INPUT_RECORDS = LOAD '/root/local_bdp/posts/filter-records-in-pig/sample_3.txt' USING PigStorage('|') AS (id:chararray,contype:chararray);

Check whether the file is loaded successfully or not.

 
 
  1. DUMP INPUT_RECORDS

Please find below screenshot for reference.

Step 2: Filter the record as per the requirement.

Use below command.

 
 
  1. INTERMD_RECORDS = FILTER INPUT_RECORDS BY (contype=='POSTP' OR contype=='PREP' OR contype IS NULL OR contype=='' );

Here only those records which have connection type in “POSTP”,”PREP” and blank will come.Blank in a string can be meant to either null or empty string so contype=’’ is also written.
Let’s visualize it using below command.

 
 
  1. DUMP INTERMD_RECORDS

Step 3: Output

 
 
  1. OUTPUT_RECORDS = FOREACH INTERMD_RECORDS GENERATE id,((contype IS NULL OR contype=='')?'NA' :contype) AS contype;

Here we have assigned “NA” values to blank contype.

Let’s visualize it using below command.

 
 
  1. DUMP OUTPUT_RECORDS

If you compare above output with input, you would be able to see the difference that unwanted records are no longer present in the output.

Wrapping up

The filter should be used as early as possible while writing pig script, which minimizes the number of unwanted records to be processed.
Keep learning 🙂

Load CSV file into hive AVRO table

Requirement You have comma separated(CSV) file and you want to create Avro table in hive on top of it, then ...
Read More

Load CSV file into hive PARQUET table

Requirement You have comma separated(CSV) file and you want to create Parquet table in hive on top of it, then ...
Read More

Hive Most Asked Interview Questions With Answers – Part II

What is bucketing and what is the use of it? Answer: Bucket is an optimisation technique which is used to ...
Read More
/ hive, hive interview, interview-qa

Spark Interview Questions Part-1

Suppose you have a spark dataframe which contains millions of records. You need to perform multiple actions on it. How ...
Read More

Leave a Reply