Requirement

Suppose you have received a data set. It is in a CSV format. Now, you want to analyze this data using spark in databricks. In this post, we will see how to import data into DBFS (Databricks File System) from a local system for analysis.

Solution

Let’s take the data file which we will import into DBFS:

emp_data1

The file contains the below data:

empno	ename	designation	manager	hire_date	sal	deptno	location
9369	SMITH	CLERK	7902	12/17/1980	800	20	BANGALORE
9499	ALLEN	SALESMAN	7698	2/20/1981	1600	30	HYDERABAD
9521	WARD	SALESMAN	7698	2/22/1981	1250	30	PUNE
9566	TURNER	MANAGER	7839	4/2/1981	2975	20	MUMBAI
9654	MARTIN	SALESMAN	7698	9/28/1981	1250	30	CHENNAI
9369	SMITH	CLERK	7902	12/17/1980	800	20	KOLKATA

Step 1: Go to Databricks URL

Once you visit the home page of the databricks cluster. You will several options like Explore, Import & Export Data, and Create notebook.

You have to choose Import & Export Data option. If you see the description, you will understand the focus areas are Quickly imported data, preview its schema, create a table, and query which we want.

Once you uploaded the file, you will have an option to choose the target directory of the DBFS. It is optional. If you are choosing any directory, it will use the default directory.

Coming to next to create a table. Here, again you will see 2 options:

Option 1: Create Table with UI

This is a very user-friendly and code-free step to create the table.

Once you click this option, it will ask you to select the cluster from the drop-down. Here, I have selected the cluster and clicked on Preview Table.

Once you click on the preview table, you will see the table data as shown below screen shot.

If you observed, the table attributes are showing as _c0, _c1 …etc with default data type String. You can select the appropriate data type from the drop-down.

From the left bar, there are multiple options to update for the table:

Table Name: By default, it shows as the file name. You can change the table name.

Create in Database: You can select the database name under which you want to create the table.

File Type: Here, you can select the file format of the uploaded file. Default coming as CSV and we have data set in CSV format. In addition, the other file formats are JSON and AVRO.

Column Delimiter: It’s a field-separated delimiter. It is showing as a comma(‘,’).

First Row is Header: Select this option if you want to keep the first row’s column value as a header.

Infer Schema: Select this option to parse the entire file to infer the schema.

Multi-line: This option allows line break in the cell.

In our case, I am selecting the first row as a header to keep the header.

If you see, the attribute name updated with the first row’s column value.

Once all this update completed, click on the Create Table.

Option 2: Create table in Notebook

When you click on the 2nd option Create Table in Notebook. A python notebook will be opened which will have all the code ready. You just have to execute it. Again, the code will be very handy and with proper documentation. Hence, it will help you in order to make any customization in the existing code.

Step 2: Table and Data Validation

Now, go to the data section and choose the database. You will see the table.

Wrapping Up

In this post, we have learned how to import data into DBFS from local and create the table very quickly. It helps to make the data available for the analyzing.

Query CSV file in Databricks

Requirement

Solution

Step 1: Go to Databricks URL

Step 2: Table and Data Validation

Wrapping Up

Leave a Reply Cancel reply

Tags