Requirement
In this post, we will learn how to handle NULL in spark dataframe. There are multiple ways to handle NULL while data processing. We will see how can we do it in Spark DataFrame.
Solution
Create Dataframe with dummy data
val df = spark.createDataFrame(Seq( (1100, "Person1", "Location1", null), (1200, "Person2", "Location2", "Contact2"), (1300, "Person3", "Location3", null), (1400, "Person4", null, "Contact4"), (1500, "Person5", "Location4", null) )).toDF("id", "name", "location", "contact")
Rows having NULL
df.filter($"location".isNull || $"contact".isNull).show df.where("location is null or contact is null").show
Remove Rows having NULL
By mentioning column name
df.filter(col("location").isNotNull && col("contact").isNotNull).show
df.where("location is not null and contact is not null").show
Without mentioning Column name
df.na.drop().show
Replace NULL with any constant value
df.withColumn("location", when($"location".isNull, "Dummy Location").otherwise($"location")).show
Wrapping Up
In this post, we have learned about handling NULL in Spark DataFrame. We can either filter or replace it with any dummy value for NULL in the row.