Requirement:

You have sample dataframe which you want to load in to parquet files using scala.

Solution:

Step 1: Sample Dataframe

use below command:

 spark-shell

Note: I am using spark 2.3 version.

To Create a sample dataframe , Please refer Create-a-spark-dataframe-from-sample-data

After following above post ,you can see that students dataframe has been created. You can use this dataframe to perform operations.

Use below command to see the content of dataframe

 students.show()

Step 2: Write into Parquet

To write the complete dataframe into parquet format,refer below code.

in below code “/tmp/sample1” is the name of directory where all the files will be stored. make sure that sample1 directory should not exist already.This path is the hdfs path.

 students_df.write.parquet("/tmp/sample1")

Step 3 : Output files

You can check the files using below command

 hadoop fs -ls /tmp/sample1/

Based on the number of partition of spark dataframe, output files will vary. You can control number of files by changing the partition using reparation.

Wrapping up:

For saving space ,parquet files are the best.Many times when you receive data in to csv files. it is better to load csv into dataframe and write into parquet format,and later delete csv files for space optimisation.

Don’t forget to subscribe us.

Don’t miss the tutorial on Top Big data courses on Udemy you should Buy

Write spark dataframe into Parquet files using scala