Load spark dataframe into non existing hive table

Requirement:

You have a dataframe which you want to save into hive table for future use. But you do not want to create the hive table first. Instead you need to save dataframe directly to the hive.

Given:

Sample data:

 
101, "alex",88.56
102, "john",68.32
103, "peter",75.62
104, "jeff",92.67
105, "mathew",89.56
106, "alan",72.57
107, "steve",96.12
108, "mark",98.45
109, "adam",76.25
109, "david",78.45

Solution:

Note : Skip the step 1 if you already have spark dataframe .

Step 1:Creation of spark dataframe

Go to Spark-shell

Note: I am using spark 2.3 version.

Use below code to create spark dataframe . if you need explanation of below code .Please refer THIS post.

 import spark.implicits._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
var stu_rdd =spark.sparkContext.parallelize(Seq(
Row(101, "alex",88.56),
Row(102, "john",68.32),
Row(103, "peter",75.62),
Row(104, "jeff",92.67),
Row(105, "mathew",89.56),
Row(106, "alan",72.57),
Row(107, "steve",96.12),
Row(108, "mark",98.45),
Row(109, "adam",76.25),
Row(109, "david",78.45)
))

var schema_list=List(("id","int"),("name","string"),("percentage","double"))
var schema=new StructType()
schema_list.map(x=> schema=schema.add(x._1,x._2))

var students = spark.createDataFrame(stu_rdd,schema)

Step 2: Saving into Hive

As you have dataframe “students” ,Let’s say table we want to create is “bdp.students_tbl” where bdp is the name of database.

use below code to save it into hive.

 students.write.saveAsTable("bdp.students_tbl")

Step 3: Output

Go to hive CLI and use below code to check the hive table

 select * from bdp.students_tbl

Wrapping up:

When we need to use dataframe result for the other applications .that time it can be useful.

Don’t forget to subscribe us. Have a great day.

Don’t miss the tutorial on Top Big data courses on Udemy you should Buy