How to get partition record in Spark Using Scala

Requirement

Suppose we are having a text format data file which contains employees basic details. When we load this file in Spark, it returns an RDD. Our requirement is to find the number of partitions which has created just after loading the data file and see what records are stored in each partition.

Components Involved

  • Spark
  • Hdfs
  • Scala

Sample Data

Let’s take a sample of few records for the demonstration:

emp_data
 
  1. empno,ename,designation,manager,hire_date,sal,deptno
  2. 7369,SMITH,CLERK,9902,2010-12-17,800.00,20
  3. 7499,ALLEN,SALESMAN,9698,2011-02-20,1600.00,30
  4. 7521,WARD,SALESMAN,9698,2011-02-22,1250.00,30
  5. 9566,JONES,MANAGER,9839,2011-04-02,2975.00,20
  6. 7654,MARTIN,SALESMAN,9698,2011-09-28,1250.00,30
  7. 9698,BLAKE,MANAGER,9839,2011-05-01,2850.00,30
  8. 9782,CLARK,MANAGER,9839,2011-06-09,2450.00,10
  9. 9788,SCOTT,ANALYST,9566,2012-12-09,3000.00,20
  10. 9839,KING,PRESIDENT,NULL,2011-11-17,5000.00,10
  11. 7844,TURNER,SALESMAN,9698,2011-09-08,1500.00,30
  12. 7876,ADAMS,CLERK,9788,2012-01-12,1100.00,20
  13. 7900,JAMES,CLERK,9698,2011-12-03,950.00,30
  14. 9902,FORD,ANALYST,9566,2011-12-03,3000.00,20
  15. 7934,MILLER,CLERK,9782,2012-01-23,1300.00,10

You can download the sample data from here

Solution

Step 1: Data Preparation

The first step is data preparation. We will use the same data which has been shared above. You can keep this data either in local or HDFS. We will see both the cases. So in my case, I have above sample data (emp_data.txt) in local at /home/anoo17/Spark-With-Scala/resource and in HDFS at /home/anoo17/Spark-With-Scala/resource.

Step 2: Start spark-shell

Open the spark shell. It will create spark context as sc.

Step 3: Load the data file

Now sc is available as Spark Context. Let’s first load local data file using it. It will return an RDD.

Load Text File
 
  1. val emp_data = sc.textFile("file:///home/anoo17/Spark-With-Scala/resource/emp_data.txt")

Now, let’s load hdfs data file:

Load HDFS Text File
 
  1. val emp_hdfs_data = sc.textFile("hdfs:///user/anoo17/Spark-With-Scala/resource/emp_data.txt")

Here, for loading data file from local and hdfs have only 1 difference which is file/hdfs in the path. So, if the path contains file keyword, it means you are loading data from the local path.

textFile – It is used to load data. It takes two arguments. One is the path of the file and other is optional which is the no. of partitions given by a user. As we have not provided any value for the second parameter. It will create the partition by itself.

Step 4: Count number of Partitions

We can check the no. of partitions created while loading data by using partitions.size function on the RDD.

count partition
 
  1. emp_data.partitions.size

Step 5: Read data from a specific Partition

Let’s read the records from the partition.
Partition Index = 0

records for partition 0
 
  1. emp_data.mapPartitionsWithIndex( (index: Int, it: Iterator[String]) =>it.toList.map(x => if (index ==0) {println(x)}).iterator).collect()

Partition Index = 1

records for partition 1
 
  1. emp_data.mapPartitionsWithIndex( (index: Int, it: Iterator[String]) =>it.toList.map(x => if (index ==1) {println(x)}).iterator).collect()

mapPartitionsWithIndex – This function will iterate all the partitions while tracking the index of the original partition. We will assign index value of the partition we want to read records. The index value is start with 0. So assigned index value as 0 for 1st partition records.

collect() – It will return an array that contains all of the elements of RDD.
Let’s load data by providing a specific number of partitions. So, we need to provide a value in the second parameter.

load file using user partition
 
  1. val emp_data= sc.textFile("file:///home/anoo17/Spark-With-Scala/resource/emp_data.txt", 1)

Here I have given 1 partition. So it will create only 1 partition in the RDD.

Read the records from the partitions:


We are seeing all the records of the file have been stored into 1 partition.

Wrapping Up

In this post, we have gone through the basic understanding of loading data files from local and hdfs into an RDD, count number of partitions of RDD and read records of a specific partition. In addition to this we have also seen how can we specify the number of the partition while loading data. You can try on the different set of data to know more about the number of partitions.

Load CSV file into hive AVRO table

Requirement You have comma separated(CSV) file and you want to create Avro table in hive on top of it, then ...
Read More

Load CSV file into hive PARQUET table

Requirement You have comma separated(CSV) file and you want to create Parquet table in hive on top of it, then ...
Read More

Hive Most Asked Interview Questions With Answers – Part II

What is bucketing and what is the use of it? Answer: Bucket is an optimisation technique which is used to ...
Read More
/ hive, hive interview, interview-qa

Spark Interview Questions Part-1

Suppose you have a spark dataframe which contains millions of records. You need to perform multiple actions on it. How ...
Read More

Leave a Reply