Create Databricks Cluster in Azure

Requirement

Do you want to explore Spark? Azure provides a cloud service platform named databricks which is built on top of the Spark. In this post, we are going to create a databricks cluster in Azure.

Solution

Follow the below steps to create the databricks cluster in Azure.

Step 1: Login to Azure Portal

Go to portal.azure.com and login with your credential.

Step 2: Search for Databricks

Search databricks and click on Azure Databricks. It will land you to another page.

Currently, we don’t have any databricks cluster. You can create either clicking on +Add or Create Azure databricks service option.

Step 3: Create databricks service in Azure

Part I: Basics

Under Basics, choose subscription, resource group (if not available, create new).

Now, provide any workspace name, choose the location, and Pricing Tier. You will get below options in the pricing tier.

Pricing Tier

  • Standard (Apache Spark, Secure with Azure AD)
  • Premium (+Role based access control)
  • Trial(Premium – 14 days)

Part II: Networking

You can choose whether you want to deploy the workspace In VNet or not.

Part III: Tag

Under tag, you can provide any variable name and value against that.

Part IV: Review + Create

Now, all looks fine. Click on create.

It will take a few mins to get complete. Once, deployment is complete, click on Go to resource.

In the databricks home page, you will see the highlighted sections. It provides all the easy ways to get started with – explore with tutorial, import/export any existing databricks script, create a new notebook, Documentation, and some already completed tasks.

Step 4: Create databricks cluster

Let’s create a new cluster on the Azure databricks platform. Here, we will set up the configure.

Go to the cluster from the left bar.

Currently, we don’t have any existing cluster. Let’s create a new one.

Below is the configuration for the cluster set up. This is the least expensive configured cluster.

Configuration

Value/Version

Cluster Name

Any name

Cluster Mode

Standard

Pool

None

Databricks Runtime Version

5.5 LTS (Scala 2.11, Spark 2.4.3)

Python Version

3

Autopilot Option

Disabled Autoscaling
Terminate cluster after 30 mins of inactivity

Worker Node

Standard_DS3_v2 x 2 [14.0 GB Memory, 4 Cores, 0.75 DBU]

Driver Node

Same as worker type

Advanced Option

Spark: We can set any spark configuration for performance tuning. Currently, I left blank. And for python,  it will show the default location.

Tags: Here, by default, you will the below tags. The cluster ID will get generated after creating the cluster. In addition to the below default tag, you can add a new tag also.

Logging & Init: Didn’t make any changes in these 2 parts.

Once you click on create, it will take a few minutes to get created. It will show as running after a completion.

Step 5: Create Notebook

Go to the workspace from the left bar. You will see 2 options – Users and Shared.

Users – workspace for the user only.

Shared – Collaborative workspace for the team.

For creating the notebook, right-click and choose Notebook.

It provides an option in the default language. You can choose any from them. It will just create a notebook in the chosen language, but it also provides a way to write any other language in the same notebook.

Step 6: Create DataFrame in Notebook

Wrapping Up

In this post, we have created the databricks cluster with a specific configuration set up. We have also created a notebook and created a dataframe with Scala API. We can also use other programming languages for the notebook like R, Spark SQL, Python.

Sharing is caring!

Subscribe to our newsletter
Loading

Leave a Reply