Requirement
Let’s say there is a system which used to extract the data from any source (can be Databases, Rest API, etc.) and dumping into Azure Data Lake Storage aka. ADLS Gen2 storage. Now, we want to access and read these files in Spark for further processing for our business requirement. In this post, we are going to read a file from Azure Data Lake Gen2 using PySpark.
Prerequisite
For this post, it is required to have:
- Azure Data Lake Storage
- Azure Databricks
Solution
In order to access ADLS Gen2 data in Spark, we need ADLS Gen2 details like Connection String, Key, Storage Name, etc.
There are multiple ways to access the ADLS Gen2 file like directly using shared access key, configuration, mount, mount using SPN, etc. Here in this post, we are going to use mount to access the Gen2 Data Lake files in Azure Databricks.
You can refer to the below post to
- Create Mount in Azure Databricks
- Create Mount in Azure Databricks using Service Principal & OAuth
In our last post, we had already created a mount point on Azure Data Lake Gen2 storage. Here, we are going to use the mount point to read a file from Azure Data Lake Gen2 using Spark Scala.
Sample Files in Azure Data Lake Gen2
For this exercise, we need some sample files with dummy data available in Gen2 Data Lake.
We have 3 files named emp_data1.csv, emp_data2.csv, and emp_data3.csv under the blob-storage folder which is at blob-container.
Python Code to Read a file from Azure Data Lake Gen2
Let’s first check the mount path and see what is available:
%fs ls /mnt/bdpdatalake/blob-storage
%python empDf = spark.read.format("csv").option("header", "true").load("/mnt/bdpdatalake/blob-storage/emp_data1.csv") display(empDf)
Wrapping Up
In this post, we have learned how to access and read files from Azure Data Lake Gen2 storage using Spark. Once the data available in the data frame, we can process and analyze this data.