Read CSV File With New Line in Spark

Requirement

The CSV file is a very common source file to get data. Sometimes the issue occurs while processing this file. It can be because of multiple reasons. Here, in this post, we are going to discuss an issue – NEW LINE Character. In this demonstration, first, we will understand the data issue, then what kind of problem can occur and at last the solution to overcome this problem.

Sample Data

Here, we have 5 employee’s records. But if you check closely in the field location, it has value with multiline. If you open this CSV in notepad/text file, you will clearly understand:

 empno,ename,designation,manager,hire_date,sal,deptno,location
9369,SMITH,CLERK,7902,12/17/1980,800,20,"1A
BANGALORE"
9499,ALLEN,SALESMAN,7698,2/20/1981,1600,30,"2B
HYDERABAD"
9521,WARD,SALESMAN,7698,2/22/1981,1250,30,PUNE
9566,TURNER,MANAGER,7839,04/02/81,2975,20,MUMBAI
9654,MARTIN,SALESMAN,7698,9/28/1981,1250,30,CHENNAI
9369,SMITH,CLERK,7902,12/17/1980,800,20,"5E
KOLKATA"

Here, it is showing 10 records.

Solution

Step 1: Read CSV in Spark

 val empDF = spark.read.option("header", "true")
                       .csv("file:///Users/dipak_shaw/bdp/data/emp_data_with_newline.csv")

If you see, in the above screenshot, the data frame has split the records and created a new row in the data frame where the new line character occurs in the value of the location column.

Step 2: Fix the New Line Character while loading the CSV

 val empDFWithNewLine = spark.read.option("header", "true")
                                  .option("multiLine", "true")
                                  .csv("file:///Users/dipak_shaw/bdp/data/emp_data_with_newline.csv")

Wrapping Up

The new line character is a common issue in a CSV file. So, we should always include the multiline option while reading the CSV file.

There are multiple options are available for the CSV file in Spark. You can explore these options in the coming post.

Sharing is caring!

Subscribe to our newsletter
Loading

Leave a Reply