Read Options in Spark

In: spark with scala

Requirement

The CSV file format is a very common file format used in many applications. Sometimes, it contains data with some additional behavior also. For example comma within the value, quotes, multiline, etc. In order to handle this additional behavior, spark provides options to handle it while processing the data.

Solution

Option	Description	Default Value	Set
header	Represent column of the data	False	True, if want to use 1st line of file as a column name. It will set String as a datatype for all the columns.
inferSchema	Infer automatically column data type	False	True, if want to take a data type of the columns
sep	Represent column separator character	,	Set any other character instead of comma
quote	Represent quote character	“	Set to any other character. Separator character within the quote will be ignored
escape	Represent escape character	\	Set to any other character.
multiline	Represent if data have multiline	False	True, if want to load files having multiline.

Example:

 val empDFWithNewLine = spark.read.option("header", "true")
                                  .option("inferSchema", "true")
                                  .option("multiLine", "true")
                                  .csv("file:///Users/dipak_shaw/bdp/data/emp_data_with_newline.csv")

Wrapping Up

These Options are generally used while reading files in Spark. It is very helpful as it handles header, schema, sep, multiline, etc. before processing the data in Spark.

Previous Post: How to use Colab

Next Post: Print RDD in Pyspark

Leave a Reply Cancel reply

You must be logged in to post a comment.