Requirement
Let’s say we have a data file with a TSV extension. It is the same as the CSV file.
What is the difference between CSV and TSV?
The difference is separating the data in the file The CSV file stores data separated by “,”, whereas TSV stores data separated by tab.
In this post, we will load the TSV file in Spark dataframe
Sample Data
Let’s take some dummy data for the exercise:
empno ename designation manager hire_date sal deptno location 9369 SMITH CLERK 7902 12/17/1980 800 20 “1A BANGALORE” 9499 ALLEN SALESMAN 7698 2/20/1981 1600 30 “2B HYDERABAD” 9521 WARD SALESMAN 7698 2/22/1981 1250 30 PUNE 9566 TURNER MANAGER 7839 04/02/81 2975 20 MUMBAI 9654 MARTIN SALESMAN 7698 9/28/1981 1250 30 CHENNAI 9369 SMITH CLERK 7902 12/17/1980 800 20 “5E KOLKATA” |
Solution
We are having a file that contains the above data with Tab-separated in the TSV file. The file is available below the path.
Find below the code snippet used to load the TSV file in Spark Dataframe.
val df1 = spark.read.option("header","true") .option("sep", "\t") .option("multiLine", "true") .option("quote","\"") .option("escape","\"") .option("ignoreTrailingWhiteSpace", true) .csv("/Users/dipak_shaw/bdp/data/emp_data1.tsv")
Here, we have used some options like the header. sep, multiline, etc. We have already covered the details about this in the post.
Wrapping Up
If you observed, we are using CSV built-in function to read the data from the TSV file and load it into Dataframe. The changes are separated by the character which has been done using the option.