Read file from Azure Data Lake Gen2 using Spark

Requirement

Let’s say there is a system which used to extract the data from any source (can be Databases, Rest API, etc.) and dumping into Azure Data Lake Storage aka. ADLS Gen2 storage. Now, we want to access and read these files in Spark for further processing for our business requirement. In this post, we are going to read a file from Azure Data Lake Gen2 using Spark.

Prerequisite

For this post, it is required to have:

  • Azure Data Lake Storage
  • Azure Databricks

Solution

In order to access ADLS Gen2 data in Spark, we need ADLS Gen2 details like Connection String, Key, Storage Name, etc.

There are multiple ways to access the ADLS Gen2 file like directly using shared access key, configuration, mount, mount using SPN, etc. Here in this post, we are going to use mount to access the Gen2 Data Lake files in Azure Databricks.

You can refer to the below post to

In our last post, we had already created a mount point on Azure Data Lake Gen2 storage. Here, we are going to use the mount point to read a file from Azure Data Lake Gen2 using Spark Scala.

Sample Files in Azure Data Lake Gen2

For this exercise, we need some sample files with dummy data available in Gen2 Data Lake. We have 3 files named emp_data1.csv, emp_data2.csv, and emp_data3.csv under the blob-storage folder which is at blob-container.

Spark Code to Read a file from Azure Data Lake Gen2

Let’s first check the mount path and see what is available:

 %fs
ls /mnt/bdpdatalake/blob-storage
 %scala
val empDf = spark.read.format("csv").option("header", "true").load("/mnt/bdpdatalake/blob-storage/emp_data1.csv")
display(empDf)

Wrapping Up

In this post, we have learned how to access and read files from Azure Data Lake Gen2 storage using Spark. Once the data available in the data frame, we can process and analyze this data.

Sharing is caring!

Subscribe to our newsletter
Loading

Leave a Reply