Requirement
Do you want to explore Spark? Azure provides a cloud service platform named databricks which is built on top of the Spark. In this post, we are going to create a databricks cluster in Azure.
![]()
Solution
Follow the below steps to create the databricks cluster in Azure.
Step 1: Login to Azure Portal
Go to portal.azure.com and login with your credential.
Step 2: Search for Databricks
Search databricks and click on Azure Databricks. It will land you to another page.
![]()
Currently, we don’t have any databricks cluster. You can create either clicking on +Add or Create Azure databricks service option.
![]()
Step 3: Create databricks service in Azure
Part I: Basics
Under Basics, choose subscription, resource group (if not available, create new).
![]()
Now, provide any workspace name, choose the location, and Pricing Tier. You will get below options in the pricing tier.
Pricing Tier
- Standard (Apache Spark, Secure with Azure AD)
- Premium (+Role based access control)
- Trial(Premium – 14 days)
Part II: Networking
You can choose whether you want to deploy the workspace In VNet or not.
![]()
Part III: Tag
Under tag, you can provide any variable name and value against that.
![]()
Part IV: Review + Create
Now, all looks fine. Click on create.
![]()
It will take a few mins to get complete. Once, deployment is complete, click on Go to resource.
![]()
![]()
In the databricks home page, you will see the highlighted sections. It provides all the easy ways to get started with – explore with tutorial, import/export any existing databricks script, create a new notebook, Documentation, and some already completed tasks.
![]()
Step 4: Create databricks cluster
Let’s create a new cluster on the Azure databricks platform. Here, we will set up the configure.
Go to the cluster from the left bar.
![]()
Currently, we don’t have any existing cluster. Let’s create a new one.
![]()
Below is the configuration for the cluster set up. This is the least expensive configured cluster.
Configuration | Value/Version |
Cluster Name | Any name |
Cluster Mode | Standard |
Pool | None |
Databricks Runtime Version | 5.5 LTS (Scala 2.11, Spark 2.4.3) |
Python Version | 3 |
Autopilot Option | Disabled Autoscaling |
Worker Node | Standard_DS3_v2 x 2 [14.0 GB Memory, 4 Cores, 0.75 DBU] |
Driver Node | Same as worker type |
Advanced Option
Spark: We can set any spark configuration for performance tuning. Currently, I left blank. And for python, it will show the default location.
![]()
Tags: Here, by default, you will the below tags. The cluster ID will get generated after creating the cluster. In addition to the below default tag, you can add a new tag also.
![]()
Logging & Init: Didn’t make any changes in these 2 parts.
![]()
![]()
Once you click on create, it will take a few minutes to get created. It will show as running after a completion.
![]()
Step 5: Create Notebook
Go to the workspace from the left bar. You will see 2 options – Users and Shared.
Users – workspace for the user only.
Shared – Collaborative workspace for the team.
For creating the notebook, right-click and choose Notebook.
![]()
![]()
It provides an option in the default language. You can choose any from them. It will just create a notebook in the chosen language, but it also provides a way to write any other language in the same notebook.
Step 6: Create DataFrame in Notebook
![]()
Wrapping Up
In this post, we have created the databricks cluster with a specific configuration set up. We have also created a notebook and created a dataframe with Scala API. We can also use other programming languages for the notebook like R, Spark SQL, Python.