Requirement
In this post, we are going to get the last modified date of a file in Spark using the File System. It helps while processing the data based on the last modified date time.
Solution
Here, let’s assume your files are available in hdfs or any storage. We will use a file system to read the metadata of the file and then extract the file’s last modified date.
import org.apache.hadoop.fs.{FileAlreadyExistsException, FileSystem, FileUtil, Path} import org.apache.hadoop.conf.Configuration import spark.implicits._ import org.apache.spark.sql._
Here, importing all the required packages for the code.
val fs = org.apache.hadoop.fs.FileSystem.get(spark.sparkContext.hadoopConfiguration)
Here, created a Hadoop file system object with spark configuration.
val status = fs.listStatus(new Path("/Users/dipak_shaw/bdp/data/emp_data1.csv"))
Here, using the above command will get the list of the file’s status.
If you see, the output value of status is in the Array of File System. Let’s convert this to Row using the below command:
val files2 = status.map(x => Row(x.getPath.toString, x.isDirectory, x.getModificationTime, x.getAccessTime))
Once it is available in Array of Row, we can convert it into Dataframe using the below command:
val dfFromArray = spark.sparkContext.parallelize(files2).map(row => (row.getString(0), row.getBoolean(1), row.getLong(2), row.getLong(3))).toDF()
Wrapping Up
The metadata of a file gives many insight details. The Hadoop file system is used to get the metadata of files available at hdfs, azure blob, etc.