Add multiple columns in spark dataframe

Requirement:

There is spark dataframe, in which it is needed to add multiple columns altogether, without writing the withColumn , multiple times, As you are not sure, how many columns would be available.

Solution :

Step 1: A spark Dataframe

Enter into your spark-shell , and create a sample dataframe, You can skip this step if you already have the spark dataframe.

 import spark.implicits._

import org.apache.spark.sql.Row

import org.apache.spark.sql.types._

var stu_rdd =spark.sparkContext.parallelize(Seq(Row(101, "alex",88.56),Row(102, "john",68.32),Row(103, "peter",75.62),Row(104, "jeff",92.67),Row(105, "mathew",89.56),Row(106, "alan",72.57),Row(107, "steve",96.12),Row(108, "mark",98.45),Row(109, "adam",76.25),Row(109, "david",78.45)))

var schema_list=List(("id","int"),("name","string"),("percentage","double"))

var schema=new StructType()

schema_list.map(x=> schema=schema.add(x._1,x._2))

var students = spark.createDataFrame(stu_rdd,schema)

You can refer to this post where I have written how to create a sample dataframe .https://bigdataprogrammers.com/create-a-spark-dataframe-from-sample-data/

So we have a dataframe of students, which we are going to use.

Step 2: List for Multiple columns.

To create multiple columns, first, we need to have a list that has information of all the columns which could be dynamically generated. for ease, we have defined the cols_Logics list of the tuple, where the first field is the name of a column and another field is the logic for that column.

Remember the first field have a string type and the second field has a col type, just like two arguments of with column.

 var cols_Logics=List(("a1",lit("etc")),("a2",expr("current_date()")),("a3",col("percentage")/100))

Step 3: foldLeft

Now we have the logic for all the columns we need to add to our spark dataframe. And the dataframe students, we will use the fold left function of the list on the cols_Logics list.

 var students_df_new= cols_Logics.foldLeft(students){ (tempdf, cols) => tempdf.withColumn(cols._1,cols._2) }

If you see the above code, you will observe that tempdf is the temporary dataframe that holds the new column till its current iteration and, once it traverses to the complete list, we will have all columns added to our initial dataframe i.e students. Col1._1 and cols._2 are used to get the tuple values. 

Students_df_new will have all the columns present in the list.

Wrapping up:

you can use a list that is dynamically generated. This type of approach is suitable when you have many columns and logics are defined somewhere or maybe in a file. Writing withColumn multiple times is a tedious task and not feasible when the number of columns is high or dynamic.  

 

Sharing is caring!

Subscribe to our newsletter
Loading

Leave a Reply