Calculate percentage in spark using scala

Calculate percentage in spark using scala

Requirement

You have marks of all the students of a class with roll number in CSV file, It is needed to calculate the percentage of each student in spark using Scala.

Given :

Download the sample CSV file marks Which have 7 columns, 1st column is Roll no and other 6 columns are subject1 subject2….subject6.

Solution

Step 1 : Loading the sample CSV file marks.csv into HDFS.

I have Local directory named as “calculate_percetage_in_spark_using_scala” in path “/root/local_bdp/problems/”
So I have kept marks.csv file in that path.

You can see sample data in below screen shot:-

Let’s create HDFS directory using below command

 
 
hadoop fs -mkdir -p hdfs://sandbox.hortonworks.com:8020/user/root/bdp/problems/cal_per_sprk/ip/

As you can see “ip” directory is created for input files.
Now we can load file into HDFS using below command

 
 
hadoop fs -put /root/local_bdp/problems/calculate_percetage_in_spark_using_scala/marks.csv hdfs://sandbox.hortonworks.com:8020/user/root/bdp/problems/cal_per_sprk/ip/

 

Step 2

Now it’s time to interact with scala shell.
Enter the below command

 
 
spark-shell

it will take you to scala shell
Create one RDD named as ip_marks which will have all the data from file marks.csv , as we know that we have kept the file in HDFS location,so mention this HDFS location As an argument in textFile.
Use below command.

 
 
var ip_marks=sc.textFile("hdfs://sandbox.hortonworks.com:8020/user/root/bdp/problems/cal_per_sprk/ip/marks.csv")

 


To see the data of this RDD use below command.

 
 
ip_marks.collect()

Refer below screen shot.

Step 3: Creation of Pair RDD

It’s time to create a Pair RDD where roll no is key and list of marks of a student is value.
As you can see in above screen shot the whole line of the file is a String. so we need to split them on the basis of a comma(,) as we have received comma separated file.
Use below command to create a new RDD list_mrks

 
 
var list_mrks=ip_marks.map(x => (x.split(',')(0),List(x.split(',')(1).toFloat,x.split(',')(2).toFloat,x.split(',')(3).toFloat,x.split(',')(4).toFloat,x.split(',')(5).toFloat,x.split(',')(6).toFloat)))

The above command is doing following task:-

1.Spliting each value on the basis of a comma.
2. Converting each value in to float using toFloat .
3.Creating a List of marks of 6 subjects.
4. Taking first value as Roll no ,hence index is 0 .
5 Creating Pair RDD where roll no is key and List is Value.

To see the data of this RDD use below command.

 
 
list_mrks.collect()

Step 4 : Calculation of Percentage

Use below command to calculate Percentage.

 
 
var per_mrks=list_mrks.mapValues(x => x.sum/x.length)

In above command mapValues function is used ,just to perform an operation on values without altering the keys. To calculate percentage we have used two functions of a list which are sum and length.we have assumed that total marks of each subject are 100.You can change the formula if you wish.

Step 5 :Output

Use below command to see the data of RDD per_mrks

 
 
per_mrks.collect()

To save the data into file use below command.

 
 
per_mrks.saveAsTextFile("hdfs://sandbox.hortonworks.com:8020/user/root/bdp/problems/cal_per_sprk/op")

It will create a “op” directory, it should not already exist.You can see the content  of output file using below command:-

 
 
hadoop fs -cat hdfs://sandbox.hortonworks.com:8020/user/root/bdp/problems/cal_per_sprk/op/*

Please refer below screen shot.

Don’t forget to subscribe us. Keep learning. Keep sharing.

3
0

Join in hive with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in pyspark with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in spark using scala with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Java UDF to convert String to date in PIG

About Code Many times it happens like you have received data from many systems and each system operates on a ...
Read More
/ java udf, Pig, pig, pig udf, string to date, udf

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.