Calculate percentage using pig

Requirement

You have marks of all the students of a class with a roll number in CSV file, It is needed to calculate the percentage of marks of students using Pig.

Given:

Download the sample CSV file marks which have 7 columns, 1st column is Roll no and other 6 columns are subject1 subject2….subject6.

Solution

Step 1: Loading the sample CSV file into HDFS

I have a local directory named as “calculate-percentage-using-pig” in path “/root/local_bdp/problems/”,
so I have kept marks.csv file in that path.

You can see sample data in below screenshot:-

Let’s create the HDFS directory using below command:

 hadoop fs -mkdir -p hdfs://sandbox.hortonworks.com:8020/user/root/bdp/problems/cal_per_pig/ip/

As you can see “ip” directory is created for input files.
Now, we can copy the file into HDFS using below command:

 hadoop fs -put /root/local_bdp/problems/calculate-percentage-using-pig/marks.csv hdfs://sandbox.hortonworks.com:8020/user/root/bdp/problems/cal_per_pig/ip/

Step 2: Grunt Shell

Now, it’s time to interact with the grunt shell.
Enter the below command:

  pig

It will take you to grunt shell.
Create one relation named as ip_marks which will have all the data of file marks.csv, as we know that we have kept the file in HDFS location, so mention this HDFS location in the path and  7 columns with datatype in below command:

 ip_marks = LOAD 'hdfs://sandbox.hortonworks.com:8020/user/root/bdp/problems/cal_per_pig/ip/marks.csv' USING PigStorage(',') AS (roll_no:Int,subject1:Int,subject2:Int,subject3:Int,subject4:Int,subject5:Int,subject6:Int)

To see the data of this Relation use below command.

 dump ip_marks

Refer below screenshot.

Step 3 : Calculation of Percentage

Use below command to calculate Percentage.

 per_mrks= FOREACH ip_marks generate roll_no,(float)(subject1+subject2+subject3+subject4+subject5+subject6)/6.0 AS percentage;

In the above command, we are adding marks of all subject and converting the sum in float and then dividing it by 6. We have assumed that the total marks of each subject are 100. You can change the formula if you wish.

Step 4: Output

Use below command to see the data of per_mrks

 dump  per_mrks

To save the data into file use below command:

 STORE per_mrks INTO 'hdfs://sandbox.hortonworks.com:8020/user/root/bdp/problems/cal_per_pig/op/' using PigStorage(',');

It will create a “op” directory. The output directory should not already exist otherwise, it will throw an exception. You can see the list of output files using below command:

 hadoop fs -ls hdfs://sandbox.hortonworks.com:8020/user/root/bdp/problems/cal_per_pig/op/

Please refer below screenshot.

Keep learning. Keep sharing.

Don’t miss the tutorial on Top Big data courses on Udemy you should Buy

Sharing is caring!

Subscribe to our newsletter
Loading

Leave a Reply