Requirement
Suppose you have received a data set. It is in a CSV format. Now, you want to analyze this data using spark in databricks. In this post, we will see how to import data into DBFS (Databricks File System) from a local system for analysis.
Solution
Let’s take the data file which we will import into DBFS:
The file contains the below data:
empno | ename | designation | manager | hire_date | sal | deptno | location |
9369 | SMITH | CLERK | 7902 | 12/17/1980 | 800 | 20 | BANGALORE |
9499 | ALLEN | SALESMAN | 7698 | 2/20/1981 | 1600 | 30 | HYDERABAD |
9521 | WARD | SALESMAN | 7698 | 2/22/1981 | 1250 | 30 | PUNE |
9566 | TURNER | MANAGER | 7839 | 4/2/1981 | 2975 | 20 | MUMBAI |
9654 | MARTIN | SALESMAN | 7698 | 9/28/1981 | 1250 | 30 | CHENNAI |
9369 | SMITH | CLERK | 7902 | 12/17/1980 | 800 | 20 | KOLKATA |
Step 1: Go to Databricks URL
Once you visit the home page of the databricks cluster. You will several options like Explore, Import & Export Data, and Create notebook.
You have to choose Import & Export Data option. If you see the description, you will understand the focus areas are Quickly imported data, preview its schema, create a table, and query which we want.
Once you uploaded the file, you will have an option to choose the target directory of the DBFS. It is optional. If you are choosing any directory, it will use the default directory.
Coming to next to create a table. Here, again you will see 2 options:
Option 1: Create Table with UI
This is a very user-friendly and code-free step to create the table.
Once you click this option, it will ask you to select the cluster from the drop-down. Here, I have selected the cluster and clicked on Preview Table.
Once you click on the preview table, you will see the table data as shown below screen shot.
If you observed, the table attributes are showing as _c0, _c1 …etc with default data type String. You can select the appropriate data type from the drop-down.
From the left bar, there are multiple options to update for the table:
Table Name: By default, it shows as the file name. You can change the table name.
Create in Database: You can select the database name under which you want to create the table.
File Type: Here, you can select the file format of the uploaded file. Default coming as CSV and we have data set in CSV format. In addition, the other file formats are JSON and AVRO.
Column Delimiter: It’s a field-separated delimiter. It is showing as a comma(‘,’).
First Row is Header: Select this option if you want to keep the first row’s column value as a header.
Infer Schema: Select this option to parse the entire file to infer the schema.
Multi-line: This option allows line break in the cell.
In our case, I am selecting the first row as a header to keep the header.
If you see, the attribute name updated with the first row’s column value.
Once all this update completed, click on the Create Table.
Option 2: Create table in Notebook
When you click on the 2nd option Create Table in Notebook. A python notebook will be opened which will have all the code ready. You just have to execute it. Again, the code will be very handy and with proper documentation. Hence, it will help you in order to make any customization in the existing code.
Step 2: Table and Data Validation
Now, go to the data section and choose the database. You will see the table.
Wrapping Up
In this post, we have learned how to import data into DBFS from local and create the table very quickly. It helps to make the data available for the analyzing.