Requirement
Do you want to explore Spark? Azure provides a cloud service platform named databricks which is built on top of the Spark. In this post, we are going to create a databricks cluster in Azure.
Solution
Follow the below steps to create the databricks cluster in Azure.
Step 1: Login to Azure Portal
Go to portal.azure.com and login with your credential.
Step 2: Search for Databricks
Search databricks and click on Azure Databricks. It will land you to another page.
Currently, we don’t have any databricks cluster. You can create either clicking on +Add or Create Azure databricks service option.
Step 3: Create databricks service in Azure
Part I: Basics
Under Basics, choose subscription, resource group (if not available, create new).
Now, provide any workspace name, choose the location, and Pricing Tier. You will get below options in the pricing tier.
Pricing Tier
- Standard (Apache Spark, Secure with Azure AD)
- Premium (+Role based access control)
- Trial(Premium – 14 days)
Part II: Networking
You can choose whether you want to deploy the workspace In VNet or not.
Part III: Tag
Under tag, you can provide any variable name and value against that.
Part IV: Review + Create
Now, all looks fine. Click on create.
It will take a few mins to get complete. Once, deployment is complete, click on Go to resource.
In the databricks home page, you will see the highlighted sections. It provides all the easy ways to get started with – explore with tutorial, import/export any existing databricks script, create a new notebook, Documentation, and some already completed tasks.
Step 4: Create databricks cluster
Let’s create a new cluster on the Azure databricks platform. Here, we will set up the configure.
Go to the cluster from the left bar.
Currently, we don’t have any existing cluster. Let’s create a new one.
Below is the configuration for the cluster set up. This is the least expensive configured cluster.
Configuration | Value/Version |
Cluster Name | Any name |
Cluster Mode | Standard |
Pool | None |
Databricks Runtime Version | 5.5 LTS (Scala 2.11, Spark 2.4.3) |
Python Version | 3 |
Autopilot Option | Disabled Autoscaling |
Worker Node | Standard_DS3_v2 x 2 [14.0 GB Memory, 4 Cores, 0.75 DBU] |
Driver Node | Same as worker type |
Advanced Option
Spark: We can set any spark configuration for performance tuning. Currently, I left blank. And for python, it will show the default location.
Tags: Here, by default, you will the below tags. The cluster ID will get generated after creating the cluster. In addition to the below default tag, you can add a new tag also.
Logging & Init: Didn’t make any changes in these 2 parts.
Once you click on create, it will take a few minutes to get created. It will show as running after a completion.
Step 5: Create Notebook
Go to the workspace from the left bar. You will see 2 options – Users and Shared.
Users – workspace for the user only.
Shared – Collaborative workspace for the team.
For creating the notebook, right-click and choose Notebook.
It provides an option in the default language. You can choose any from them. It will just create a notebook in the chosen language, but it also provides a way to write any other language in the same notebook.
Step 6: Create DataFrame in Notebook
Wrapping Up
In this post, we have created the databricks cluster with a specific configuration set up. We have also created a notebook and created a dataframe with Scala API. We can also use other programming languages for the notebook like R, Spark SQL, Python.