Create a spark dataframe from sample data


You have sample data of some students and you want to create a dataframe to perform some operations.


Sample data:

  1. 101, "alex",88.56
  2. 102, "john",68.32
  3. 103, "peter",75.62
  4. 104, "jeff",92.67
  5. 105, "mathew",89.56
  6. 106, "alan",72.57
  7. 107, "steve",96.12
  8. 108, "mark",98.45
  9. 109, "adam",76.25
  10. 109, "david",78.45


Step 1: go to spark shell

use below command:

  1. spark-shell

Note: I am using spark 2.3 version.

And then make necessary imports using below code

  1. import spark.implicits._
  2. import org.apache.spark.sql.Row
  3. import org.apache.spark.sql.types._


Step 2: Creation of RDD

Let’s create a rdd ,in which we will have one Row for each sample data.

You have to use parallelize keyword to create a rdd.

Use below code

  1. var stu_rdd =spark.sparkContext.parallelize(Seq(Row(101, "alex",88.56),Row(102, "john",68.32),Row(103, "peter",75.62),Row(104, "jeff",92.67),Row(105, "mathew",89.56),Row(106, "alan",72.57),Row(107, "steve",96.12),Row(108, "mark",98.45),Row(109, "adam",76.25),Row(109, "david",78.45)))


Step 3: Creation of Schema

Now you have Rdd but it is schema-less. As we have three columns in each row, say column names as id, name, and percentage.

For ease we are defining our schema in the list as below:


  1. var schema_list=List(("id","int"),("name","string"),("percentage","double"))

Now we will have one schema having structType.use below command

  1. var schema=new StructType()

Now let’s iterate over the list and add all the list items into the schema

  1.> schema=schema.add(x._1,x._2))

After executing the above code you will have the schema as mentioned in the list “schema_list”

Step 4: The creation of Dataframe:

Now to create dataframe you need to pass rdd and schema into createDataFrame as below:

  1. var students = spark.createDataFrame(stu_rdd,schema)

you can see that students dataframe has been created. You can use this dataframe to perform operations.

Use below command to see the content of dataframe


Wrapping up:

Many times while coding we need to have dataframe of sample data to understand the business requirement and to get the better understanding of data. That time it would be handy and will be helpful.

Don’t forget to subscribe us.

Load CSV file into hive AVRO table

Requirement You have comma separated(CSV) file and you want to create Avro table in hive on top of it, then ...
Read More

Load CSV file into hive PARQUET table

Requirement You have comma separated(CSV) file and you want to create Parquet table in hive on top of it, then ...
Read More

Hive Most Asked Interview Questions With Answers – Part II

What is bucketing and what is the use of it? Answer: Bucket is an optimisation technique which is used to ...
Read More
/ hive, hive interview, interview-qa

Spark Interview Questions Part-1

Suppose you have a spark dataframe which contains millions of records. You need to perform multiple actions on it. How ...
Read More

Leave a Reply