Calculate percentage in spark using scala

Calculate percentage in spark using scala

Requirement

You have marks of all the students of a class with roll number in CSV file, It is needed to calculate the percentage of each student in spark using Scala.

Given :

Download the sample CSV file marks Which have 7 columns, 1st column is Roll no and other 6 columns are subject1 subject2….subject6.

Solution

Step 1 : Loading the sample CSV file marks.csv into HDFS.

I have Local directory named as “calculate_percetage_in_spark_using_scala” in path “/root/local_bdp/problems/”
So I have kept marks.csv file in that path.

You can see sample data in below screen shot:-

Let’s create HDFS directory using below command

 
 
  1. hadoop fs -mkdir -p hdfs://sandbox.hortonworks.com:8020/user/root/bdp/problems/cal_per_sprk/ip/

As you can see “ip” directory is created for input files.
Now we can load file into HDFS using below command

 
 
  1. hadoop fs -put /root/local_bdp/problems/calculate_percetage_in_spark_using_scala/marks.csv hdfs://sandbox.hortonworks.com:8020/user/root/bdp/problems/cal_per_sprk/ip/

 

Step 2

Now it’s time to interact with scala shell.
Enter the below command

 
 
  1. spark-shell

it will take you to scala shell
Create one RDD named as ip_marks which will have all the data from file marks.csv , as we know that we have kept the file in HDFS location,so mention this HDFS location As an argument in textFile.
Use below command.

 
 
  1. var ip_marks=sc.textFile("hdfs://sandbox.hortonworks.com:8020/user/root/bdp/problems/cal_per_sprk/ip/marks.csv")

 


To see the data of this RDD use below command.

 
 
  1. ip_marks.collect()

Refer below screen shot.

Step 3: Creation of Pair RDD

It’s time to create a Pair RDD where roll no is key and list of marks of a student is value.
As you can see in above screen shot the whole line of the file is a String. so we need to split them on the basis of a comma(,) as we have received comma separated file.
Use below command to create a new RDD list_mrks

 
 
  1. var list_mrks=ip_marks.map(x => (x.split(',')(0),List(x.split(',')(1).toFloat,x.split(',')(2).toFloat,x.split(',')(3).toFloat,x.split(',')(4).toFloat,x.split(',')(5).toFloat,x.split(',')(6).toFloat)))

The above command is doing following task:-

1.Spliting each value on the basis of a comma.
2. Converting each value in to float using toFloat .
3.Creating a List of marks of 6 subjects.
4. Taking first value as Roll no ,hence index is 0 .
5 Creating Pair RDD where roll no is key and List is Value.

To see the data of this RDD use below command.

 
 
  1. list_mrks.collect()

Step 4 : Calculation of Percentage

Use below command to calculate Percentage.

 
 
  1. var per_mrks=list_mrks.mapValues(x => x.sum/x.length)

In above command mapValues function is used ,just to perform an operation on values without altering the keys. To calculate percentage we have used two functions of a list which are sum and length.we have assumed that total marks of each subject are 100.You can change the formula if you wish.

Step 5 :Output

Use below command to see the data of RDD per_mrks

 
 
  1. per_mrks.collect()

To save the data into file use below command.

 
 
  1. per_mrks.saveAsTextFile("hdfs://sandbox.hortonworks.com:8020/user/root/bdp/problems/cal_per_sprk/op")

It will create a “op” directory, it should not already exist.You can see the content  of output file using below command:-

 
 
  1. hadoop fs -cat hdfs://sandbox.hortonworks.com:8020/user/root/bdp/problems/cal_per_sprk/op/*

Please refer below screen shot.

Don’t forget to subscribe us. Keep learning. Keep sharing.

3

Join in hive with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in pyspark with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in spark using scala with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.