NULLs in Spark DataFrame

In: DataIngestion, spark with scala

Requirement

In this post, we will learn how to handle NULL in spark dataframe. There are multiple ways to handle NULL while data processing. We will see how can we do it in Spark DataFrame.

Solution

Create Dataframe with dummy data

 val df = spark.createDataFrame(Seq(
  (1100, "Person1", "Location1", null),
  (1200, "Person2", "Location2", "Contact2"),
  (1300, "Person3", "Location3", null),
  (1400, "Person4", null, "Contact4"),
  (1500, "Person5", "Location4", null)
)).toDF("id", "name", "location", "contact")

Rows having NULL

df.filter($"location".isNull || $"contact".isNull).show

df.where("location is null or contact is null").show

Remove Rows having NULL

By mentioning column name

df.filter(col("location").isNotNull && col("contact").isNotNull).show

df.where("location is not null and contact is not null").show

Without mentioning Column name

 df.na.drop().show

Replace NULL with any constant value

 df.withColumn("location", when($"location".isNull, "Dummy Location").otherwise($"location")).show

Wrapping Up

In this post, we have learned about handling NULL in Spark DataFrame. We can either filter or replace it with any dummy value for NULL in the row.

Previous Post: SELECT in Spark DataFrame

Next Post: Convert Schema to DataFrame in Spark

Leave a Reply Cancel reply

You must be logged in to post a comment.