Get last element in list of dataframe in Spark

Requirement

In this post, we will learn how to get last element in list of dataframe in spark.

Solution

Create a dataframe with dummy data:

 val df = spark.createDataFrame(Seq(
("1100", "Person1", "Street1#Location1#City1", null),
("1200", "Person2", "Street2#Location2#City2", "Contact2"),
("1300", "Person3", "Street3#Location3#City3", null),
("1400", "Person4", null, "Contact4"),
("1500", "Person5", "Street5#Location5#City5", null)
)).toDF("id", "name", "address", "contact")

Option 1: Get last element using index

 val df2 = df.withColumn("address", lit(split(col("address"), "\\#").getItem(2)))

Here, I have split the column on character “#” and took the 2nd index value that is the city.

Option 2: Get last element using UDF

 import scala.util.{Try,Success,Failure}
val get_last = udf((xs: Seq[String]) => Try(xs.last).toOption)

val df3 = df.withColumn("address", get_last(split(col("address"), "\\#")))

Here, I have created a UDF which takes an argument as a sequence of String data type and returns the last element using the last function.

Option 3: Get last element using SQL

You can create a temp table from the dataframe and perform the below query:

df.createOrReplaceTempView("vw_tbl")
val df4 = spark.sql("SELECT reverse(split(address, '#'))[0] from vw_tbl")

Here, in the first line, I have created a temp view from the dataframe.

In the 2nd line, executed a SQL query having Split on address column and used reverse function to the 1st value using index 0.

Wrapping Up

This post has learned to get the last element of any collection value in Dataframe using 3 different options – directly using an index, by creating a generic UDF, and last using SQL query.

Sharing is caring!

Subscribe to our newsletter
Loading

Leave a Reply