Requirement

Suppose we are having a text format data file which contains employees basic details. When we load this file in Spark, it returns an RDD. Our requirement is to find the number of partitions which has created just after loading the data file and see what records are stored in each partition.

Components Involved

Spark
Hdfs
Scala

Sample Data

Let’s take a sample of few records for the demonstration:

 empno,ename,designation,manager,hire_date,sal,deptno
7369,SMITH,CLERK,9902,2010-12-17,800.00,20
7499,ALLEN,SALESMAN,9698,2011-02-20,1600.00,30
7521,WARD,SALESMAN,9698,2011-02-22,1250.00,30
9566,JONES,MANAGER,9839,2011-04-02,2975.00,20
7654,MARTIN,SALESMAN,9698,2011-09-28,1250.00,30
9698,BLAKE,MANAGER,9839,2011-05-01,2850.00,30
9782,CLARK,MANAGER,9839,2011-06-09,2450.00,10
9788,SCOTT,ANALYST,9566,2012-12-09,3000.00,20
9839,KING,PRESIDENT,NULL,2011-11-17,5000.00,10
7844,TURNER,SALESMAN,9698,2011-09-08,1500.00,30
7876,ADAMS,CLERK,9788,2012-01-12,1100.00,20
7900,JAMES,CLERK,9698,2011-12-03,950.00,30
9902,FORD,ANALYST,9566,2011-12-03,3000.00,20
7934,MILLER,CLERK,9782,2012-01-23,1300.00,10

You can download the sample data from here

Solution

Step 1: Data Preparation

The first step is data preparation. We will use the same data which has been shared above. You can keep this data either in local or HDFS. We will see both the cases. So in my case, I have above sample data (emp_data.txt) in local at /home/anoo17/Spark-With-Scala/resource and in HDFS at /home/anoo17/Spark-With-Scala/resource.

Step 2: Start spark-shell

Open the spark shell. It will create spark context as sc.

Step 3: Load the data file

Now sc is available as Spark Context. Let’s first load local data file using it. It will return an RDD.

 val emp_data = sc.textFile("file:///home/anoo17/Spark-With-Scala/resource/emp_data.txt")

Now, let’s load hdfs data file:

 val emp_hdfs_data = sc.textFile("hdfs:///user/anoo17/Spark-With-Scala/resource/emp_data.txt")

Here, for loading data file from local and hdfs have only 1 difference which is file/hdfs in the path. So, if the path contains file keyword, it means you are loading data from the local path.

textFile – It is used to load data. It takes two arguments. One is the path of the file and other is optional which is the no. of partitions given by a user. As we have not provided any value for the second parameter. It will create the partition by itself.

Step 4: Count number of Partitions

We can check the no. of partitions created while loading data by using partitions.size function on the RDD.

 emp_data.partitions.size

Step 5: Read data from a specific Partition

Let’s read the records from the partition.
Partition Index = 0

 emp_data.mapPartitionsWithIndex( (index: Int, it: Iterator[String]) =>it.toList.map(x => if (index ==0) {println(x)}).iterator).collect()

Partition Index = 1

 emp_data.mapPartitionsWithIndex( (index: Int, it: Iterator[String]) =>it.toList.map(x => if (index ==1) {println(x)}).iterator).collect()

mapPartitionsWithIndex – This function will iterate all the partitions while tracking the index of the original partition. We will assign index value of the partition we want to read records. The index value is start with 0. So assigned index value as 0 for 1st partition records.

collect() – It will return an array that contains all of the elements of RDD.
Let’s load data by providing a specific number of partitions. So, we need to provide a value in the second parameter.

val emp_data= sc.textFile("file:///home/anoo17/Spark-With-Scala/resource/emp_data.txt", 1)

Here I have given 1 partition. So it will create only 1 partition in the RDD.

Read the records from the partitions:

We are seeing all the records of the file have been stored into 1 partition.

Wrapping Up

In this post, we have gone through the basic understanding of loading data files from local and hdfs into an RDD, count number of partitions of RDD and read records of a specific partition. In addition to this we have also seen how can we specify the number of the partition while loading data. You can try on the different set of data to know more about the number of partitions.

How to get partition record in Spark Using Scala