Load JSON Data in Hive non-partitioned table using Spark

Load JSON Data in Hive non-partitioned table using Spark

Requirement

Suppose there is a source data which is in JSON format. The requirement is to load JSON data in Hive non-partitioned table using Spark.

Let’s break the requirement into two task:

  • Load JSON data in spark data frame and read it
  • Store it in a hive non-partition table

Components Involved

To achieve the requirement, below components will be used:

  • Hive – It is used to store data in non-partitioned and with ORC file format.
  • Spark SQL – It is used to load the JSON data, process and store into the hive.

Here hive table will be a non-partitioned table and will store the data in an ORC format.

Solution

Step 1: JSON sample data

The sample of JSON formatted data:

json sample data
 
  1. [{"id":1,"first_name":"Steven","last_name":"Garrett","email":"sgarrett0@amazon.co.uk","gender":"Male","designation":"Design Engineer","phone":"62-(454)694-4544","country":"Indonesia"},{"id":2,"first_name":"Brandon","last_name":"Green","email":"bgreen1@fc2.com","gender":"Male","designation":"Graphic Designer","phone":"7-(587)965-1714","country":"Russia"},{"id":3,"first_name":"Martha","last_name":"Bennett","email":"mbennett2@abc.net.au","gender":"Female","designation":"Recruiting Manager","phone":"48-(582)234-3809","country":"Poland"},{"id":4,"first_name":"Samuel","last_name":"Lopez","email":"slopez3@usgs.gov","gender":"Male","designation":"Automation Specialist I","phone":"33-(654)376-1795","country":"France"},{"id":5,"first_name":"Judy","last_name":"Bishop","email":"jbishop4@163.com","gender":"Female","designation":"Librarian","phone":"81-(509)760-1241","country":"Japan"},{"id":6,"first_name":"Anna","last_name":"Morales","email":"amorales5@eventbrite.com","gender":"Female","designation":"Assistant Professor","phone":"33-(675)922-1030","country":"France"},{"id":7,"first_name":"Benjamin","last_name":"Walker","email":"bwalker6@furl.net","gender":"Male","designation":"Computer Systems Analyst II","phone":"86-(249)310-6467","country":"China"},{"id":8,"first_name":"Sean","last_name":"Perkins","email":"sperkins7@usatoday.com","gender":"Male","designation":"Nurse","phone":"1-(504)398-8997","country":"Canada"}]

Download sample data from below link:

sample json data

Once you download the file, keep it in local path. In my case, the file path is “home/bdp/My_Work_Book/Spark/jsondata.json”.

Step 2: Verify hive database

First, open the hive CLI and verify the database. The hive table will create under this database. If database available, then select the database using “use” command.

hive db
 
hive> use bdp_db;
OK
Time taken: 0.024 seconds
hive> show tables;
OK
Time taken: 0.274 seconds

Here, I have used directly “USE” command of the hive to select database. In case, the database does not exist, it will give an error. So to create a database you can use CREATE DATABASE <database_name>. “Show Tables” command will give a list of existing tables under selected Database.

Step 3: Execution on spark-shell

In this step, we will import JSON in hive using spark SQL. First, have to start spark command line. Here I am using pyspark command to start.

load json data
 
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.1
      /_/
Using Python version 2.7.12 (default, Jun 29 2016 11:08:50)
SparkContext available as sc, HiveContext available as sqlContext.
>>>

Now, I am having pyspark shell. If you notice in the last line, it has already created Spark Context as sc and Hive Context as sqlContext.

 
 
>>> import json
>>> from pyspark.sql import HiveContext
>>> hiveContext = HiveContext(sc)
>>> jsonDF = hiveContext.read.json('file:///home/bdp/My_Work_Book/Spark/jsondata.json')

Here, I have imported JSON library to parse JSON file. I am creating HiveContext from the SparkContext.  In the last line, we are loading the JSON file. The JSON file path is the local path where JSON file exists. This will return a data frame.

To check the schema of the data frame:

dataframe schema
 
>>> jsonDF.printSchema()

 There is a function called “show”. It will show data frame records. In addition, it takes an optional argument. This argument is nothing but to limit the records to show.

show data
 
  1. >>> jsonDF.show(3)

Now, we are done with the first task. The second task is to save the data frame in hive table with “” formatted.

save dataframe in hive
 
  1. >>> jsonDF.write.format("orc").saveAsTable("bdp_db.jsonTest")
  2. INFO HiveContext$$anon$2: Persisting data source relation `bdp_db`.`jsonTest` with a single input path into Hive metastore in Hive compatible format. Input path: hdfs://user/bdp/db/jsontest.

In the above command, using format to specify the format of the storage and saveAsTable to save the data frame as a hive table. That’s it. It will store the data frame date into hive database bdp_db with the table name “jsonTest”.

Step 4: Verify data in Hive

Once done with step 3. Let’s verify the hive table in database bdp_db. I am already under bdp_db database. So directly checking the table.

hive query
 
  1. hive> show tables;
  2. OK
  3. jsontest
  4. Time taken: 0.111 seconds, Fetched: 1 row(s)

To describe the table use below query:

 
 
hive> desc formatted jsontest;
OK
# col_name data_type comment
country string
designation string
email string
first_name string
gender string
id bigint
last_name string
phone string
# Detailed Table Information
Database: bdp_db
Owner: bdp
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://user/bdp/db/jsontest
Table Type: MANAGED_TABLE
Table Parameters:
 COLUMN_STATS_ACCURATE false
 EXTERNAL FALSE
 numFiles 1
 numRows -1
 rawDataSize -1
 spark.sql.sources.provider orc
 spark.sql.sources.schema.numParts 1
 spark.sql.sources.schema.part.0 {\"type\":\"struct\",\"fields\":[{\"name\":\"country\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"designation\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"email\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"first_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"gender\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"last_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"phone\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}
 totalSize 28195
 transient_lastDdlTime 1489597485
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
 path hdfs://user/xdishaw/db/jsontest
 serialization.format 1
Time taken: 0.146 seconds, Fetched: 43 row(s)

Check data in hive table.

select
 
--#Check data
hive> select * from jsontest limit 3;

Wrapping Up

Here we directly loaded JSON data into a Spark data frame. Once the data is available in the data frame, we can process it with transformation and action. The saveAsTable function gets used to save the content of the data frame in a table.

40
0

Join in hive with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in pyspark with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in spark using scala with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Java UDF to convert String to date in PIG

About Code Many times it happens like you have received data from many systems and each system operates on a ...
Read More
/ java udf, Pig, pig, pig udf, string to date, udf

2 Comments

  1. Is it needed to create output table first. In my case i am facing issue like can’t read table ,table do not have sequence file..pls help me on this.

    1

    0

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.