How to calculate Rank in dataframe using python with example

Requirement :

You have marks of all the students of class and you want to find ranks of students using python.

Given :

A pipe separated file which contains roll number and marks of students : below are the sample values :-

R_no	marks
101	389
102	412
103	435

Please download sample file marks

Solution :

Step 1 : Loading the Raw data into hive table

As you can see we have our raw data into file which is pipe separated. I am keeping the raw file into class8 directory which is just created . Use below commands:-

 cd bdp/projects/

mkdir class8

cd class8

cat marks.txt

Now I am putting this file into Hdfs location after creating stu_marks directory into Hdfs location.

Use below commands:-

 hadoop fs -mkdir bdps/stu_marks

hadoop fs -put marks.txt bdps/stu_marks/

hadoop fs -ls bdps/stu_marks/

Now we will load this file into hive table . Please refer this post if you have trouble loading file into hive table .You need to give pipe (|) as delimiter .

I have created one hive table named as “Studnt_mrks” on top of this data .

Which have two columns and both of them are of Int type.

Please use below commands one by one to create a hive table on top of it.

 CREATE SCHEMA IF NOT EXISTS bdp;

CREATE EXTERNAL TABLE IF NOT EXISTS bdp.class8_marks

(roll_no INT,

ttl_marks INT)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '|'

STORED AS TEXTFILE

LOCATION 'hdfs://sandbox-hdp.hortonworks.com:8020/user/root/bdps/stu_marks';

As you can see above the Location which I gave here is the path where my hdfs file is present.

Step 2: – Loading hive table into Spark using python

First open pyspark shell by using below command:-

 pyspark

Once the CLI is opened .Use below commands to load the hive table:-

stu_marks=spark.table("bdp.class8_marks")

here you can see stu_marks is the data frame which contains the data of hive table: – You can see the data using show command :-

 stu_marks.show()

Step 3 : Assign Rank to the students

Now let’s come to the actual logic to find the rank of the students :-

Please use below command to import the required functions

 from pyspark.sql.functions import rank
from pyspark.sql.window import Window
from pyspark.sql.functions import col

 ranked=stu_marks.withColumn("rank",rank().over(Window.orderBy(col("ttl_marks").desc())))

In above command I am using rank function over marks . As we want to rank higher if one has score higher marks, So we are using desc .

Below command will give you the expected results ,In which rank of student is assigned against roll no.

 ranked.show()

You can save the result as per your requirements .

Wrapping Up:

Here we have understood how the rank functions works and what is the simple use case of this function. We can assign rank partition wise as well ,For that you have to use partition by in over clause .

Rank can be used if you want to find the result of n’th rank holder .You can filter based on the required rank. If you are looking for the same code in scala instead of python .Please read this blog post.

Don’t forget to subscribe our blog.

Don’t miss the tutorial on Top Big data courses on Udemy you should Buy

How to calculate Rank in dataframe using python with example

Requirement :

Given :

Solution :

Step 1 : Loading the Raw data into hive table

Step 2: – Loading hive table into Spark using python

Step 3 : Assign Rank to the students

Wrapping Up:

Leave a Reply Cancel reply

Load JSON Data in Hive non-partitioned table using Spark

Load JSON Data into Hive Partitioned table using PySpark

Load Text file into Hive Table Using Spark

How to create spark application in IntelliJ

Transpose Data in Spark DataFrame using PySpark

How to calculate Rank in dataframe using python with example

Join in pyspark with example

how to delete column in spark dataframe

Print RDD in Pyspark

Convert RDD to Dataframe in Pyspark

Trim Column in PySpark DataFrame

How to aggregate data in a Spark DataFrame?

How to pivot data in a Spark DataFrame?

How to create a Spark Streaming application

How to Write Data to Kafka in Spark Streaming

How to Perform Sliding Window Operations in Spark Streaming?

How to perform stateful operations in Spark Streaming?

How to handle data backpressure in Spark Streaming?

How to cache data in Spark SQL?

How to use broadcast variables in Spark?

How to Use Accumulators in Spark?

How to optimize Spark SQL queries?

How to Use Spark with Cassandra?

Certifications

Top Machine Learning Courses You Shouldn’t Miss

Top courses for data engineers

Top Big Data Courses on Udemy You should Take

How to calculate Rank in dataframe using python with example

Requirement :

Given :

Solution :

Step 1 : Loading the Raw data into hive table

Step 2: – Loading hive table into Spark using python

Step 3 : Assign Rank to the students

Wrapping Up:

Leave a Reply Cancel reply

Tags