Requirement
In this post, we are going to get the last modified date of a file in Spark using the File System. It helps while processing the data based on the last modified date time.
![]()
Solution
Here, let’s assume your files are available in hdfs or any storage. We will use a file system to read the metadata of the file and then extract the file’s last modified date.
import org.apache.hadoop.fs.{FileAlreadyExistsException, FileSystem, FileUtil, Path}
import org.apache.hadoop.conf.Configuration
import spark.implicits._
import org.apache.spark.sql._Here, importing all the required packages for the code.
val fs = org.apache.hadoop.fs.FileSystem.get(spark.sparkContext.hadoopConfiguration)
Here, created a Hadoop file system object with spark configuration.
val status = fs.listStatus(new Path("/Users/dipak_shaw/bdp/data/emp_data1.csv"))Here, using the above command will get the list of the file’s status.
![]()
If you see, the output value of status is in the Array of File System. Let’s convert this to Row using the below command:
val files2 = status.map(x => Row(x.getPath.toString, x.isDirectory, x.getModificationTime, x.getAccessTime))
![]()
Once it is available in Array of Row, we can convert it into Dataframe using the below command:
val dfFromArray = spark.sparkContext.parallelize(files2).map(row => (row.getString(0), row.getBoolean(1), row.getLong(2), row.getLong(3))).toDF()
![]()
Wrapping Up
The metadata of a file gives many insight details. The Hadoop file system is used to get the metadata of files available at hdfs, azure blob, etc.