Requirement
Assume you have the hive table named as reports. It is required to process this dataset in spark. Once we have data of hive table in the Spark data frame, we can further transform it as per the business needs. So let’s try to load hive table in the Spark data frame.
Solution
Follow the below steps:
Step 1: Sample table in Hive
Let’s create table “reports” in the hive. I am using bdp schema in which I am creating a table.
Enter in to hive CLI and use below commands to create a table:
create schema bdp; create table bdp.reports(id int,days int,year int); INSERT INTO TABLE bdp.reports VALUES (121,232,2015),(122,245,2015),(123,134,2014),(126,67,2016),(182,122,2016),(137,92,2015),(101,311,2015);
Please refer below screenshot.
Step 2: Check table data
Enter the below command to see the records which you have inserted.
select * from bdp.reports;
Refer below screenshot for reference.
Step 3: Data Frame Creation
Go to spark-shell using below command:
spark-shell
Please check whether SQL context with hive support is available or not.
In below screenshot, you can see that at the bottom “Created SQL context (with Hive support).
SQL context available as sqlContext.” is written. It means that you can use the sqlContext object to interact with the hive.
Now, create a data frame hiveReports using below command:
var hiveReports = sqlContext.sql("select * from bdp.reports")
You have to pass your hive query in it. Whatever data is return by this query, will be available in the data frame.
Step 4: Output
Check whether dataset report is loaded into data frame hiveReport or not using below command:
Check schema of Data Frame:
hiveReports.printSchema()
Show data of Data Frame:-
hiveReports.show()
It will show the same output which we got in step 2.
Please refer below screenshot.
You can use this data frame further to join with another dataset, filter or to perform transformation as per needs.
Keep learning.
Don’t miss the tutorial on Top Big data courses on Udemy you should Buy