Query CSV file in Databricks

Requirement

Suppose you have received a data set. It is in a CSV format. Now, you want to analyze this data using spark in databricks. In this post, we will see how to import data into DBFS (Databricks File System) from a local system for analysis.

Solution

Let’s take the data file which we will import into DBFS:

emp_data1

The file contains the below data:

empno

ename

designation

manager

hire_date

sal

deptno

location

9369

SMITH

CLERK

7902

12/17/1980

800

20

BANGALORE

9499

ALLEN

SALESMAN

7698

2/20/1981

1600

30

HYDERABAD

9521

WARD

SALESMAN

7698

2/22/1981

1250

30

PUNE

9566

TURNER

MANAGER

7839

4/2/1981

2975

20

MUMBAI

9654

MARTIN

SALESMAN

7698

9/28/1981

1250

30

CHENNAI

9369

SMITH

CLERK

7902

12/17/1980

800

20

KOLKATA

Step 1: Go to Databricks URL

Once you visit the home page of the databricks cluster. You will several options like Explore, Import & Export Data, and Create notebook.

You have to choose Import & Export Data option. If you see the description, you will understand the focus areas are Quickly imported data, preview its schema, create a table, and query which we want.

Once you uploaded the file, you will have an option to choose the target directory of the DBFS. It is optional. If you are choosing any directory, it will use the default directory.

Coming to next to create a table. Here, again you will see 2 options:

Option 1: Create Table with UI

This is a very user-friendly and code-free step to create the table.

Once you click this option, it will ask you to select the cluster from the drop-down. Here, I have selected the cluster and clicked on Preview Table.

Once you click on the preview table, you will see the table data as shown below screen shot.

If you observed, the table attributes are showing as _c0, _c1 …etc with default data type String. You can select the appropriate data type from the drop-down.

From the left bar, there are multiple options to update for the table:

Table Name: By default, it shows as the file name. You can change the table name.

Create in Database: You can select the database name under which you want to create the table.

File Type: Here, you can select the file format of the uploaded file. Default coming as CSV and we have data set in CSV format. In addition, the other file formats are JSON and AVRO.

Column Delimiter: It’s a field-separated delimiter. It is showing as a comma(‘,’).

First Row is Header: Select this option if you want to keep the first row’s column value as a header.

Infer Schema: Select this option to parse the entire file to infer the schema.

Multi-line: This option allows line break in the cell.

In our case, I am selecting the first row as a header to keep the header.

If you see, the attribute name updated with the first row’s column value.

Once all this update completed, click on the Create Table.

Option 2: Create table in Notebook

When you click on the 2nd option Create Table in Notebook. A python notebook will be opened which will have all the code ready. You just have to execute it. Again, the code will be very handy and with proper documentation. Hence, it will help you in order to make any customization in the existing code.

Step 2: Table and Data Validation

Now, go to the data section and choose the database. You will see the table.

Wrapping Up

In this post, we have learned how to import data into DBFS from local and create the table very quickly. It helps to make the data available for the analyzing.

Sharing is caring!

Subscribe to our newsletter
Loading

Leave a Reply