Requirement:
You have sample data of some students and you want to create a dataframe to perform some operations.
Given:
Sample data:
101, "alex",88.56 102, "john",68.32 103, "peter",75.62 104, "jeff",92.67 105, "mathew",89.56 106, "alan",72.57 107, "steve",96.12 108, "mark",98.45 109, "adam",76.25 109, "david",78.45
Solution:
Step 1: go to spark shell
use below command:
spark-shell
Note: I am using spark 2.3 version.
And then make necessary imports using below code
import spark.implicits._ import org.apache.spark.sql.Row import org.apache.spark.sql.types._
Step 2: Creation of RDD
Let’s create a rdd ,in which we will have one Row for each sample data.
You have to use parallelize keyword to create a rdd.
Use below code
var stu_rdd =spark.sparkContext.parallelize(Seq(Row(101, "alex",88.56),Row(102, "john",68.32),Row(103, "peter",75.62),Row(104, "jeff",92.67),Row(105, "mathew",89.56),Row(106, "alan",72.57),Row(107, "steve",96.12),Row(108, "mark",98.45),Row(109, "adam",76.25),Row(109, "david",78.45)))
Step 3: Creation of Schema
Now you have Rdd but it is schema-less. As we have three columns in each row, say column names as id, name, and percentage.
For ease we are defining our schema in the list as below:
var schema_list=List(("id","int"),("name","string"),("percentage","double"))
Now we will have one schema having structType.use below command
var schema=new StructType()
Now let’s iterate over the list and add all the list items into the schema
schema_list.map(x=> schema=schema.add(x._1,x._2))
After executing the above code you will have the schema as mentioned in the list “schema_list”
Step 4: The creation of Dataframe:
Now to create dataframe you need to pass rdd and schema into createDataFrame as below:
var students = spark.createDataFrame(stu_rdd,schema)
you can see that students dataframe has been created. You can use this dataframe to perform operations.
Use below command to see the content of dataframe
students.show()
Wrapping up:
Many times while coding we need to have dataframe of sample data to understand the business requirement and to get the better understanding of data. That time it would be handy and will be helpful.
Don’t forget to subscribe us.
Don’t miss the tutorial on Top Big data courses on Udemy you should Buy