Join in pig

Requirement

You have two tables named as A and B and you want to perform all types of join in Pig. It will help you to understand, how join works in pig.

Solution

Step 1: Input Files

Download file  Aand B from here. And place them into a local directory.

File A and B are the comma delimited file, please refer below:-

I am placing these files into local directory ‘/root/local_bdp/posts/join-in-pig’

Step 2: Enter into Pig Grunt shell.

Enter into grunt shell, type pig –x local.
Now, we are using local mode because we have placed the files in a local directory. Load these two files using the below commands. Replace the location of the file as per your local location.

 A = LOAD '/root/local_bdp/posts/join-in-pig/A' using PigStorage(',') AS (id :int,type:chararray);

B = LOAD '/root/local_bdp/posts/join-in-pig/B' using PigStorage(',') AS (id :int,type:chararray);

Let’s understand join one by one:

A. Inner Join:

Sometimes, it is required to have only common records out of two datasets. Now we have two tables A & B, we are joining based on a key which is id.
So in output, only those records which match id with another dataset will come. Rest will be discarded.

Use below command to perform the inner join in Pig.

 INNER_JOIN = JOIN A BY id, B BY id;

 

Expected output:


Use below command to see the output set:

 DUMP INNER_JOIN

Please refer below screenshot for reference.


As you can see only records which have the same id such as 1, 3, 4 are present in the output, rest has been discarded.

B. Left Join

this type of join is performed when we want to look up for something from other datasets, the best example would be fetching a phone no. of an employee from other datasets based on the employee code.
Use below command to perform the left join.

 LEFT_JOIN = JOIN A BY id LEFT, B BY id;

Expected output

Use below command to see the output set:

 DUMP LEFT_JOIN

Now we have all the records of left table A and matched records of table B.

C. Right Join

This type of join is performed when we want to get all the data of the look-up table with only matching records of the left table.

Use below command to perform right join:

 RIGHT_JOIN = JOIN A BY id RIGHT ,B BY id;

Expected output

Use below command to see the output set:

 DUMP LEFT_JOIN

Now, we have all the records of right table B and matched records of table A.

D.Full Join

When it is needed to get all the matched and unmatched records out of two datasets, we can use full join. All data from left as well as from right datasets will appear in result set. Non-matching records will have null values in the respective columns.
Use below command to perform full join:

 FULL_JOIN = JOIN A BY id FULL, B BY id;

Expected output

Use below command to see the output set:

 DUMP FULL_JOIN

Now, we have all matched and unmatched records in output as shown above.

Wrapping Up

Joins are important when you have to deal with data which are present in multiple tables. In real life, we get files from many sources which have a relation between them, in order to get the meaningful information from these data-sets, it is needed to perform join to get the combined result.

Don’t miss the tutorial on Top Big data courses on Udemy you should Buy

Sharing is caring!

Subscribe to our newsletter
Loading

Leave a Reply