Read Options in Spark

Requirement

The CSV file format is a very common file format used in many applications. Sometimes, it contains data with some additional behavior also. For example comma within the value, quotes, multiline, etc. In order to handle this additional behavior, spark provides options to handle it while processing the data.

Solution

Option

Description

Default Value

Set

header

Represent column of the data

False

True, if want to use 1st line of file as a column name.
It will set String as a datatype for all the columns.

inferSchema

Infer automatically column data type

False

True, if want to take a data type of the columns

sep

Represent column separator character

,

Set any other character instead of comma

quote

Represent quote character

Set to any other character. Separator character within the quote will be ignored

escape

Represent escape character

\

Set to any other character.

multiline

Represent if data have multiline

False

True, if want to load files having multiline.

Example:

 val empDFWithNewLine = spark.read.option("header", "true")
                                  .option("inferSchema", "true")
                                  .option("multiLine", "true")
                                  .csv("file:///Users/dipak_shaw/bdp/data/emp_data_with_newline.csv")

Wrapping Up

These Options are generally used while reading files in Spark. It is very helpful as it handles header, schema, sep, multiline, etc. before processing the data in Spark.

Sharing is caring!

Subscribe to our newsletter
Loading

Leave a Reply