NULLs in Spark DataFrame

Requirement

In this post, we will learn how to handle NULL in spark dataframe. There are multiple ways to handle NULL while data processing. We will see how can we do it in Spark DataFrame.

Solution

Create Dataframe with dummy data

 val df = spark.createDataFrame(Seq(
  (1100, "Person1", "Location1", null),
  (1200, "Person2", "Location2", "Contact2"),
  (1300, "Person3", "Location3", null),
  (1400, "Person4", null, "Contact4"),
  (1500, "Person5", "Location4", null)
)).toDF("id", "name", "location", "contact")

 

Rows having NULL

df.filter($"location".isNull || $"contact".isNull).show

df.where("location is null or contact is null").show

Remove Rows having NULL

By mentioning column name

df.filter(col("location").isNotNull && col("contact").isNotNull).show
df.where("location is not null and contact is not null").show

Without mentioning Column name

 df.na.drop().show

Replace NULL with any constant value

 df.withColumn("location", when($"location".isNull, "Dummy Location").otherwise($"location")).show

Wrapping Up

In this post, we have learned about handling NULL in Spark DataFrame. We can either filter or replace it with any dummy value for NULL in the row.

Sharing is caring!

Subscribe to our newsletter
Loading

Leave a Reply