Requirement
The CSV file format is a very common file format used in many applications. Sometimes, it contains data with some additional behavior also. For example comma within the value, quotes, multiline, etc. In order to handle this additional behavior, spark provides options to handle it while processing the data.
Solution
Option | Description | Default Value | Set |
header | Represent column of the data | False | True, if want to use 1st line of file as a column name. |
inferSchema | Infer automatically column data type | False | True, if want to take a data type of the columns |
sep | Represent column separator character | , | Set any other character instead of comma |
quote | Represent quote character | “ | Set to any other character. Separator character within the quote will be ignored |
escape | Represent escape character | \ | Set to any other character. |
multiline | Represent if data have multiline | False | True, if want to load files having multiline. |
Example:
val empDFWithNewLine = spark.read.option("header", "true") .option("inferSchema", "true") .option("multiLine", "true") .csv("file:///Users/dipak_shaw/bdp/data/emp_data_with_newline.csv")
Wrapping Up
These Options are generally used while reading files in Spark. It is very helpful as it handles header, schema, sep, multiline, etc. before processing the data in Spark.