How to get partition record in Spark Using Scala

How to get partition record in Spark Using Scala


Suppose we are having a text format data file which contains employees basic details. When we load this file in Spark, it returns an RDD. Our requirement is to find the number of partitions which has created just after loading the data file and see what records are stored in each partition.

Components Involved

  • Spark
  • Hdfs
  • Scala

Sample Data

Let’s take a sample of few records for the demonstration:

  1. empno,ename,designation,manager,hire_date,sal,deptno
  2. 7369,SMITH,CLERK,9902,2010-12-17,800.00,20
  3. 7499,ALLEN,SALESMAN,9698,2011-02-20,1600.00,30
  4. 7521,WARD,SALESMAN,9698,2011-02-22,1250.00,30
  5. 9566,JONES,MANAGER,9839,2011-04-02,2975.00,20
  6. 7654,MARTIN,SALESMAN,9698,2011-09-28,1250.00,30
  7. 9698,BLAKE,MANAGER,9839,2011-05-01,2850.00,30
  8. 9782,CLARK,MANAGER,9839,2011-06-09,2450.00,10
  9. 9788,SCOTT,ANALYST,9566,2012-12-09,3000.00,20
  10. 9839,KING,PRESIDENT,NULL,2011-11-17,5000.00,10
  11. 7844,TURNER,SALESMAN,9698,2011-09-08,1500.00,30
  12. 7876,ADAMS,CLERK,9788,2012-01-12,1100.00,20
  13. 7900,JAMES,CLERK,9698,2011-12-03,950.00,30
  14. 9902,FORD,ANALYST,9566,2011-12-03,3000.00,20
  15. 7934,MILLER,CLERK,9782,2012-01-23,1300.00,10

You can download the sample data from here


Step 1: Data Preparation

The first step is data preparation. We will use the same data which has been shared above. You can keep this data either in local or HDFS. We will see both the cases. So in my case, I have above sample data (emp_data.txt) in local at /home/anoo17/Spark-With-Scala/resource and in HDFS at /home/anoo17/Spark-With-Scala/resource.

Step 2: Start spark-shell

Open the spark shell. It will create spark context as sc.

Step 3: Load the data file

Now sc is available as Spark Context. Let’s first load local data file using it. It will return an RDD.

Load Text File
  1. val emp_data = sc.textFile("file:///home/anoo17/Spark-With-Scala/resource/emp_data.txt")

Now, let’s load hdfs data file:

Load HDFS Text File
  1. val emp_hdfs_data = sc.textFile("hdfs:///user/anoo17/Spark-With-Scala/resource/emp_data.txt")

Here, for loading data file from local and hdfs have only 1 difference which is file/hdfs in the path. So, if the path contains file keyword, it means you are loading data from the local path.

textFile – It is used to load data. It takes two arguments. One is the path of the file and other is optional which is the no. of partitions given by a user. As we have not provided any value for the second parameter. It will create the partition by itself.

Step 4: Count number of Partitions

We can check the no. of partitions created while loading data by using partitions.size function on the RDD.

count partition
  1. emp_data.partitions.size

Step 5: Read data from a specific Partition

Let’s read the records from the partition.
Partition Index = 0

records for partition 0
  1. emp_data.mapPartitionsWithIndex( (index: Int, it: Iterator[String]) => => if (index ==0) {println(x)}).iterator).collect()

Partition Index = 1

records for partition 1
  1. emp_data.mapPartitionsWithIndex( (index: Int, it: Iterator[String]) => => if (index ==1) {println(x)}).iterator).collect()

mapPartitionsWithIndex – This function will iterate all the partitions while tracking the index of the original partition. We will assign index value of the partition we want to read records. The index value is start with 0. So assigned index value as 0 for 1st partition records.

collect() – It will return an array that contains all of the elements of RDD.
Let’s load data by providing a specific number of partitions. So, we need to provide a value in the second parameter.

load file using user partition
  1. val emp_data= sc.textFile("file:///home/anoo17/Spark-With-Scala/resource/emp_data.txt", 1)

Here I have given 1 partition. So it will create only 1 partition in the RDD.

Read the records from the partitions:

We are seeing all the records of the file have been stored into 1 partition.

Wrapping Up

In this post, we have gone through the basic understanding of loading data files from local and hdfs into an RDD, count number of partitions of RDD and read records of a specific partition. In addition to this we have also seen how can we specify the number of the partition while loading data. You can try on the different set of data to know more about the number of partitions.


Join in hive with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in pyspark with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in spark using scala with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.