Create a spark dataframe from sample data

Requirement:

You have sample data of some students and you want to create a dataframe to perform some operations.

Given:

Sample data:

 101, "alex",88.56
102, "john",68.32
103, "peter",75.62
104, "jeff",92.67
105, "mathew",89.56
106, "alan",72.57
107, "steve",96.12
108, "mark",98.45
109, "adam",76.25
109, "david",78.45

Solution:

Step 1: go to spark shell

use below command:

 spark-shell

Note: I am using spark 2.3 version.

And then make necessary imports using below code

 import spark.implicits._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

 

Step 2: Creation of RDD

Let’s create a rdd ,in which we will have one Row for each sample data.

You have to use parallelize keyword to create a rdd.

Use below code

 var stu_rdd =spark.sparkContext.parallelize(Seq(Row(101, "alex",88.56),Row(102, "john",68.32),Row(103, "peter",75.62),Row(104, "jeff",92.67),Row(105, "mathew",89.56),Row(106, "alan",72.57),Row(107, "steve",96.12),Row(108, "mark",98.45),Row(109, "adam",76.25),Row(109, "david",78.45)))

 

Step 3: Creation of Schema

Now you have Rdd but it is schema-less. As we have three columns in each row, say column names as id, name, and percentage.

For ease we are defining our schema in the list as below:

 

 var schema_list=List(("id","int"),("name","string"),("percentage","double"))

Now we will have one schema having structType.use below command

 var schema=new StructType()

Now let’s iterate over the list and add all the list items into the schema

 schema_list.map(x=> schema=schema.add(x._1,x._2))

After executing the above code you will have the schema as mentioned in the list “schema_list”

Step 4: The creation of Dataframe:

Now to create dataframe you need to pass rdd and schema into createDataFrame as below:

 var students = spark.createDataFrame(stu_rdd,schema)

you can see that students dataframe has been created. You can use this dataframe to perform operations.

Use below command to see the content of dataframe

 students.show()

Wrapping up:

Many times while coding we need to have dataframe of sample data to understand the business requirement and to get the better understanding of data. That time it would be handy and will be helpful.

Don’t forget to subscribe us.

Don’t miss the tutorial on Top Big data courses on Udemy you should Buy

Sharing is caring!

Subscribe to our newsletter
Loading

Leave a Reply