Project : Second toppers of school

 

To Calculate Second Topper of Each class of School. It will help you to understand how to calculate second maximum in Pig Scripts.

Goal:

The Main aim of this project is to understand how to find second maximum value in Apache pig.

Requirement:

To find the second topper of each class and to store it into hive table.

Given:

1. Data of each class is given in CSV files, which are present at the local location.
2. Accomplish the target using bash shell script, pig script, and hive script.

You can see sample input files as shown below:

Get the sample files from below links .

class1 class7

Blueprint

Let’s start developing this small project. I think it would be great if we first draw the blueprint of the project.

Summary of Blueprint:-

1. As mentioned we have class wise input data at the local location. So it will act as an input to our application.

2. We have to transfer this data into hdfs location, from where it will be available for the pig.

3. Pig will take the data from hdfs location and calculate the second maximum of each class .It will store the output data to one hdfs location.

4. We will create a hive layer on top of this data, from where it will be available for the analysis purpose.

You can visualize the design with the help of below diagram:

   

Hey you know, the great news is we made a great start. We have designed a main part of the project and that was the blueprint.Now after looking at the blueprint I can say that I need one pig script to transform the data .let’s say it as calToppers.pig.

As we will store our output data into hdfs location. So we need a HQL which can create a hive layer on top of it, say SecToppers.hql,and for step by step execution of each process of blueprint, we need a shell script name it as runApp.sh

Things become easy when we look at pictures, so I offer you to look at the below picture which will surely help you to imagine the design.

1. runApp.sh

Let’s start writing runApp.sh

Defined the following variables for the application.
1. localLocation for the Local location of the application.
2. hdfsLocation for the Hdfs location of the application.
3. inputLocation for the input location of input files, it will be in hdfs.
4. outputLocation for storing output data into hdfs location.

Define the following processes
1 Creation of hdfs location of the project.
2 Creation of input Location of the project.
3 Put input data from local location to hdfs location.
4 Call pig script to calculate toppers.
5 Call hive script to create hive table on top of output data.

Here is code ready for you.

 
 
  1. #!/bin/bash
  2. #BIGDATAPROGR/\MMERS
  3. #http://bigdataprogrammers.com/
  4. # PROJECT : LEVEL 1 : secondTopperOfSchool
  5. #To Find Second Toppers of School using Apache Pig.
  6. #HDFS Location of project is :- hdfs://sandbox.hortonworks.com:8020/user/root/bdp/projects/secondTopperOfSchool
  7. #Local Location of Project is :- /root/local_bdp/projects/level1/secondTopperOfSchool
  8. ######################################################
  9. # Deployment Script of  secondTopperOfSchool.
  10. #To set Intial Directories.(change these two locations as per your environment)
  11. export hdfsLocation=hdfs://sandbox.hortonworks.com:8020/user/root/bdp/projects/secondTopperOfSchool
  12. export localLocation=/root/local_bdp/projects/level1/secondTopperOfSchool
  13. echo "Starting The Application....."
  14. # Create hdfs directory for the project
  15. echo "Creating Hdfs directory...."
  16. hadoop fs -mkdir $hdfsLocation
  17. #To Create input location of files in hdfs.
  18. export inputLocation=$hdfsLocation/ip
  19. hadoop fs -mkdir $inputLocation
  20. #Put input files into Hdfs location
  21. echo "Transferring input files to hdfs location..."
  22. hadoop fs -put $localLocation/input_files/* $inputLocation/
  23. #call Pig Script which will calculate second toppers and store the results in to output hdfs directory.
  24. export outputLocation=$hdfsLocation/op
  25. echo "Calculating second topper of all class...."
  26. pig -f $localLocation/calToppers.pig -param inputLocation=$inputLocation -param outputLocation=$outputLocation
  27. # Create a hive layer on top of output Data.
  28. echo "storing results in to hive table ..."
  29. hive -f "$localLocation/SecToppers.hql" -hiveconf outputLocation=$outputLocation
  30. echo "please view results in SecToppers table of hive."
  31. echo "Process Completed."
  32. echo "Done:BIGDATAPROGR/\MMERS"

Above shell script is to control the flow of project.As you can see that we are calling one pig script and one hive script.So let’s take a look what‘s there inside these script.

2. calToppers.pig

calToppers.pig
 
  1. CLASS_RESULTS = LOAD '$inputLocation' using PigStorage(',') AS (name:chararray,percentage:float,class:int);
  2. GROUPD_CLASS_RESULTS = GROUP CLASS_RESULTS BY class;
  3. SECOND_TOPPERS = FOREACH GROUPD_CLASS_RESULTS { SORTED1 = ORDER CLASS_RESULTS BY percentage DESC; LIMITED = LIMIT SORTED1 2; SORTED2 = ORDER LIMITED BY percentage ASC; SEC_TOPPER = LIMIT SORTED2 1; GENERATE FLATTEN(SEC_TOPPER);};
  4. STORE SECOND_TOPPERS INTO '$outputLocation' using PigStorage(',');

We have main logic to determine second topper of each class.
Let’s go step by step.

1. Load the data of all classes in to pig:
As we have data in one hdfs location and that is defined as $inputLocation in shell script.So as mentioned earlier we have Csv files like class1.csv, class2.csv etc.So we have to load all the data in to one relation. I am taking one relation i.e. CLASS_RESULTS in which I am trying to get all the data using /*

So below is the command for loading data.

 
 
  1. CLASS_RESULTS = LOAD '$inputLocation/*' using PigStorage(',') AS (name:chararray,percentage:float,class:int);

2. Now this relation is having data of all classes, but we want to get topper’s name class wise.So without second thought we can group this relation on the basis of class.

 
 
  1. GROUPD_CLASS_RESULTS = GROUP CLASS_RESULTS BY class;

3. This step can be divided in to four sub steps. The relation name is SECOND_TOPPERS but it is having four relation in it which are as follows:-
SORTED1,LIMITED,SORTED2,SEC_TOPPER

We can understand this steps with the help of below logic.

Suppose you have one array and you want to find second maximum then below would be logic

Array – [9, 3, 6, 1, 5, 2, 8, 7, 4]

A. Sort the array in descending order

[9, 8, 7, 6, 5, 4, 3, 2, 1]
See the SORTED1 in command.

B. Remove other array elements and limit the index by 1. i.e. (0, 1)

[9, 8]
See the LIMITED in command.

C. Sort the array in ascending order.

[8, 9]
See the SORTED2 in command.

D. Limit the array by one element.

[8] and that is second maximum.

See the SEC_TOPPER in command.

Similar logic we have used to identify the second toppers based on percentage. So we have sorted based on percentage. And at last we have flatten the records because in output we want to get flatted records.

 
 
  1. SECOND_TOPPERS = FOREACH GROUPD_CLASS_RESULTS { SORTED1 = ORDER CLASS_RESULTS BY percentage DESC; LIMITED = LIMIT SORTED1 2; SORTED2 = ORDER LIMITED BY percentage ASC; SEC_TOPPER = LIMIT SORTED2 1; GENERATE FLATTEN(SEC_TOPPER);};

4. Now we can store output to output location which is defined in shell script.

 
 
  1. STORE SECOND_TOPPERS INTO '$outputLocation' using PigStorage(',');

As we are using two variables i.e. $outputLocation and $inputLocation in pig script so we have to pass these two variables while calling pig script. Please refer line 26 in shell script.

3. SecToppers.hql

SecToppers.hql
 
  1. CREATE SCHEMA IF NOT EXISTS bdp;
  2. CREATE EXTERNAL TABLE IF NOT EXISTS bdp.SecToppers
  3. (name STRING,
  4. percentage FLOAT,
  5. class INT)
  6. ROW FORMAT DELIMITED
  7. FIELDS TERMINATED BY ','
  8. STORED AS TEXTFILE
  9. LOCATION '${hiveconf:outputLocation}';

As Store command is written at the end of pig script.So it will store data in to output location. And then we can create a hive layer on top of it. So to create that run HQL SecToppers.hql which expects one argument outputLocation.

Setup

Download the complete project from here.

Change the local and hdfs location in shell script runApp.sh.

Execution

To run this complete application we just need to run the shell script which will take care of execution of other processes.
Command to run the script .

 
 
  1. Sh <location of your project>/secondTopperOfSchool/runApp.sh

Look at the below screenshot for reference

Once the execution completed you would see below messages.

Result

To see the output go to hive prompt and enter below query.

 
 
  1. select * from bdp.sectoppers;

It will show second toppers of school with their percentage and class.Below is the screen shot for reference.

Don’t forget to subscribe us for more tiny projects .keep learning.