Requirement
In this post, we will learn how to get last element in list of dataframe in spark.
Solution
Create a dataframe with dummy data:
val df = spark.createDataFrame(Seq( ("1100", "Person1", "Street1#Location1#City1", null), ("1200", "Person2", "Street2#Location2#City2", "Contact2"), ("1300", "Person3", "Street3#Location3#City3", null), ("1400", "Person4", null, "Contact4"), ("1500", "Person5", "Street5#Location5#City5", null) )).toDF("id", "name", "address", "contact")
Option 1: Get last element using index
val df2 = df.withColumn("address", lit(split(col("address"), "\\#").getItem(2)))
Here, I have split the column on character “#” and took the 2nd index value that is the city.
Option 2: Get last element using UDF
import scala.util.{Try,Success,Failure} val get_last = udf((xs: Seq[String]) => Try(xs.last).toOption) val df3 = df.withColumn("address", get_last(split(col("address"), "\\#")))
Here, I have created a UDF which takes an argument as a sequence of String data type and returns the last element using the last function.
Option 3: Get last element using SQL
You can create a temp table from the dataframe and perform the below query:
df.createOrReplaceTempView("vw_tbl") val df4 = spark.sql("SELECT reverse(split(address, '#'))[0] from vw_tbl")
Here, in the first line, I have created a temp view from the dataframe.
In the 2nd line, executed a SQL query having Split on address column and used reverse function to the 1st value using index 0.
Wrapping Up
This post has learned to get the last element of any collection value in Dataframe using 3 different options – directly using an index, by creating a generic UDF, and last using SQL query.