Requirement
The CSV file is a very common source file to get data. Sometimes the issue occurs while processing this file. It can be because of multiple reasons. Here, in this post, we are going to discuss an issue – NEW LINE Character. In this demonstration, first, we will understand the data issue, then what kind of problem can occur and at last the solution to overcome this problem.
Sample Data
Here, we have 5 employee’s records. But if you check closely in the field location, it has value with multiline. If you open this CSV in notepad/text file, you will clearly understand:
empno,ename,designation,manager,hire_date,sal,deptno,location 9369,SMITH,CLERK,7902,12/17/1980,800,20,"1A BANGALORE" 9499,ALLEN,SALESMAN,7698,2/20/1981,1600,30,"2B HYDERABAD" 9521,WARD,SALESMAN,7698,2/22/1981,1250,30,PUNE 9566,TURNER,MANAGER,7839,04/02/81,2975,20,MUMBAI 9654,MARTIN,SALESMAN,7698,9/28/1981,1250,30,CHENNAI 9369,SMITH,CLERK,7902,12/17/1980,800,20,"5E KOLKATA"
Here, it is showing 10 records.
Solution
Step 1: Read CSV in Spark
val empDF = spark.read.option("header", "true") .csv("file:///Users/dipak_shaw/bdp/data/emp_data_with_newline.csv")
If you see, in the above screenshot, the data frame has split the records and created a new row in the data frame where the new line character occurs in the value of the location column.
Step 2: Fix the New Line Character while loading the CSV
val empDFWithNewLine = spark.read.option("header", "true") .option("multiLine", "true") .csv("file:///Users/dipak_shaw/bdp/data/emp_data_with_newline.csv")
Wrapping Up
The new line character is a common issue in a CSV file. So, we should always include the multiline option while reading the CSV file.
There are multiple options are available for the CSV file in Spark. You can explore these options in the coming post.