Requirement:
You have sample dataframe which you want to load in to parquet files using scala.
Solution:
Step 1: Sample Dataframe
use below command:
spark-shell
Note: I am using spark 2.3 version.
To Create a sample dataframe , Please refer Create-a-spark-dataframe-from-sample-data
After following above post ,you can see that students dataframe has been created. You can use this dataframe to perform operations.
Use below command to see the content of dataframe
students.show()
Step 2: Write into Parquet
To write the complete dataframe into parquet format,refer below code.
in below code “/tmp/sample1” is the name of directory where all the files will be stored. make sure that sample1 directory should not exist already.This path is the hdfs path.
students_df.write.parquet("/tmp/sample1")
Step 3 : Output files
You can check the files using below command
hadoop fs -ls /tmp/sample1/
Based on the number of partition of spark dataframe, output files will vary. You can control number of files by changing the partition using reparation.
Wrapping up:
For saving space ,parquet files are the best.Many times when you receive data in to csv files. it is better to load csv into dataframe and write into parquet format,and later delete csv files for space optimisation.
Don’t forget to subscribe us.
Don’t miss the tutorial on Top Big data courses on Udemy you should Buy