Requirement

You have marks of all the students of a class with roll number in CSV file. It is needed to calculate the percentage of marks of students in Spark using Scala.

Given:

Download the sample CSV file marks which have 7 columns, 1st column is Roll no and other 6 columns are subject1 subject2….subject6.

Solution

Step 1: Loading the sample CSV file into HDFS

I have local directory named as “calculate_percetage_in_spark_using_scala” at path “/root/local_bdp/problems/”,
so I have kept the marks.csv file in that path.

You can see sample data in below screenshot:-

Let’s create the HDFS directory using below command:

 hadoop fs -mkdir -p hdfs://sandbox.hortonworks.com:8020/user/root/bdp/problems/cal_per_sprk/ip/

As you can see “ip” directory is created for input files.
Now, we can copy the file into HDFS using below command:

 hadoop fs -put /root/local_bdp/problems/calculate_percetage_in_spark_using_scala/marks.csv hdfs://sandbox.hortonworks.com:8020/user/root/bdp/problems/cal_per_sprk/ip/

Step 2: Write code in Spark-Shell

Now, it’s time to interact with scala shell.
Enter the below command:

 spark-shell

It will take you to spark-shell.
Create one RDD named as ip_marks which will have all the data from file marks.csv. We have kept the file in HDFS location, so mention HDFS location as an argument of textFile.
Use below command:

 var ip_marks=sc.textFile("hdfs://sandbox.hortonworks.com:8020/user/root/bdp/problems/cal_per_sprk/ip/marks.csv")

To see the data of this RDD use below command:

 ip_marks.collect()

Refer below screenshot.

Step 3: Creation of Pair RDD

It’s time to create a Pair RDD where roll no is key and list of marks of a student are values.
As you can see in the above screenshot, the whole line of the file is a String, so we need to split them on the basis of a comma(,) as we have received comma separated file.
Use below command to create a new RDD list_mrks:

 var list_mrks=ip_marks.map(x => (x.split(',')(0),List(x.split(',')(1).toFloat,x.split(',')(2).toFloat,x.split(',')(3).toFloat,x.split(',')(4).toFloat,x.split(',')(5).toFloat,x.split(',')(6).toFloat)))

The above command is doing the following task:-

1. Splitting each value on the basis of a comma.
2. Converting each value in to float using toFloat.
3. Creating a List of marks of 6 subjects.
4. Taking first value as Roll no, hence index is 0.
5. Creating Pair RDD where roll no is key and List is Value.

To see the data of this RDD use below command.

 list_mrks.collect()

Step 4: Calculation of Percentage

Use below command to calculate Percentage:

 var per_mrks=list_mrks.mapValues(x => x.sum/x.length)

In the above command mapValues function is used, just to perform an operation on values without altering the keys. We have used two functions of a list which are sum and length for calculating the percentage. We have assumed that the total marks of each subject are 100. You can change the formula if you wish.

Step 5: Output

Use below command to see the data of RDD per_mrks:

 per_mrks.collect()

To save the data into file use below command:

 per_mrks.saveAsTextFile("hdfs://sandbox.hortonworks.com:8020/user/root/bdp/problems/cal_per_sprk/op")

It will create a “op” directory. The output directory should not already exist otherwise, it will throw an exception.

You can see the content of the output file using the below command:

 hadoop fs -cat hdfs://sandbox.hortonworks.com:8020/user/root/bdp/problems/cal_per_sprk/op/*

Refer below screenshot:

Keep learning. Keep sharing.

Don’t miss the tutorial on Top Big data courses on Udemy you should Buy

Calculate percentage in spark using scala