Load Parquet Files in spark dataframe using scala

Requirement :

You have parquet file(s)  present in the hdfs location. And you need to load the data into the spark dataframe.

Solution :

Step 1 : Input files (parquet format)

Here we are assuming you already have files in any hdfs directory in parquet format. In case if you do not have the parquet files then , please refer this post to learn how to write data in parquet format.

For me the files in parquet format are available in the hdfs directory /tmp/sample1

use below command to list all the parquet files present in hdfs location.

 hadoop fs -ls /tmp/sample1

Step 2 : Go To Spark-shell

Now go to spark shell using below command :

 spark-shell

Make sure that user of spark shell have at least read permission on those files. 

Step 3.1 : Load into dataframe:

Now we will load the files in to spark dataframe , here we are considering that all the files present in the directory have same schema. It means that suppose you have three files in the directory , and all having schema as [id int,name string, percentage double]. If there is mismatch then you’ won’t be able to load the all data. (* See step 3.2)

Below is the code to load the data.

var df=spark.read.parquet("/tmp/sample")

Here we are mentioning the hdfs directory to get all files of this directory  .

Use df.show() command to view the loaded data.

Step 3.2 : Merge Schema In case of multiple schema.

If you have multiple files with different schema , then you need to set  one extra option i.e mergeSchema to true , see below code

 var df=spark.read.option("mergeSchema",true).format("parquet").load("/tmp/anydir/*")

*where anydir have multiple parquet files with different schema.

In this case all the columns of all schema will be present and, for one row columns of others will be present with null values.

Step 3.3 passing schema in df:

If you don’t need all the columns , It is better to get only the required columns , for that you can pass your schema :

use below code to define the schema

 

 

 

 var df= spark.read.schema(schema).parquet("/tmp/sample1")

If you want to see it in action refer : https://www.youtube.com/watch?v=xz_MDyKnjqg 

 

Sharing is caring!

Subscribe to our newsletter
Loading

Leave a Reply