Load JSON data in Hive using PySpark

Requirement

Suppose there is a source data which is in JSON format. The requirement is to load JSON data into Hive non-partitioned table using Spark.

Let’s break the requirement into two tasks:

Load JSON data in spark data frame and read it
Store into hive non-partition table

Components Involved

To achieve the requirement, below components will be used:

Hive – It is used to store data in a non-partitioned table with ORC file format.
Spark SQL – It is used to load the JSON data, process and store into the hive.

Here, the hive table will be a non-partitioned table and will store the data in ORC format.

Solution

Step 1: JSON sample data

The sample of JSON formatted data:

 [{"id":1,"first_name":"Steven","last_name":"Garrett","email":"sgarrett0@amazon.co.uk","gender":"Male","designation":"Design Engineer","phone":"62-(454)694-4544","country":"Indonesia"},{"id":2,"first_name":"Brandon","last_name":"Green","email":"bgreen1@fc2.com","gender":"Male","designation":"Graphic Designer","phone":"7-(587)965-1714","country":"Russia"},{"id":3,"first_name":"Martha","last_name":"Bennett","email":"mbennett2@abc.net.au","gender":"Female","designation":"Recruiting Manager","phone":"48-(582)234-3809","country":"Poland"},{"id":4,"first_name":"Samuel","last_name":"Lopez","email":"slopez3@usgs.gov","gender":"Male","designation":"Automation Specialist I","phone":"33-(654)376-1795","country":"France"},{"id":5,"first_name":"Judy","last_name":"Bishop","email":"jbishop4@163.com","gender":"Female","designation":"Librarian","phone":"81-(509)760-1241","country":"Japan"},{"id":6,"first_name":"Anna","last_name":"Morales","email":"amorales5@eventbrite.com","gender":"Female","designation":"Assistant Professor","phone":"33-(675)922-1030","country":"France"},{"id":7,"first_name":"Benjamin","last_name":"Walker","email":"bwalker6@furl.net","gender":"Male","designation":"Computer Systems Analyst II","phone":"86-(249)310-6467","country":"China"},{"id":8,"first_name":"Sean","last_name":"Perkins","email":"sperkins7@usatoday.com","gender":"Male","designation":"Nurse","phone":"1-(504)398-8997","country":"Canada"}]

Download sample data from the below link:

sample json data

Once you download the file, keep it in the local path. In my case, the file path is “home/bdp/My_Work_Book/Spark/jsondata.json”.

Step 2: Create hive database

First, open the hive CLI and create a database using below command. If the database is available, then select the database using “use” command.

hive> create database bdp_db;
OK
Time taken: 0.024 seconds

hive> use bdp_db;
OK
Time taken: 0.018 seconds

Step 3: Execution on spark-shell

In this step, we will import JSON in hive using spark SQL. First, have to start the spark command line. Here I am using the pyspark command to start.

 Welcome to

      ____              __

     / __/__  ___ _____/ /__

    _\ \/ _ \/ _ `/ __/  '_/

   /__ / .__/\_,_/_/ /_/\_\   version 1.6.1

      /_/

Using Python version 2.7.12 (default, Jun 29 2016 11:08:50)

SparkContext available as sc, HiveContext available as sqlContext.
>>>

Now, I am having a pyspark shell. If you notice in the last line, it has already created Spark Context as sc and Hive Context as sqlContext.

>>> import json
>>> from pyspark.sql import HiveContext
>>> hiveContext = HiveContext(sc)
>>> jsonDF = hiveContext.read.json('file:///home/bdp/My_Work_Book/Spark/jsondata.json')

Here, I have imported JSON library to parse JSON file. I am creating HiveContext from the SparkContext. In the last line, we are loading the JSON file. The JSON file path is the local path where the JSON file exists. This will return a data frame.

To check the schema of the data frame:

 >>> jsonDF.printSchema()

There is a function called “show”. It will show data frame records. In addition, it takes an optional argument. This argument is nothing but to limit the records to show.

>>> jsonDF.show(3)

Now, we are done with the first task. The second task is to save the data frame in hive table with “ORC” formatted.

>>> jsonDF.write.format("orc").saveAsTable("bdp_db.jsonTest")

INFO HiveContext$$anon$2: Persisting data source relation `bdp_db`.`jsonTest` with a single input path into Hive metastore in Hive compatible format. Input path: hdfs://user/bdp/db/jsontest.

In the above command, using format to specify the format of the storage and saveAsTable to save the data frame as a hive table. That’s it. It will store the data frame into hive database bdp_db with the table name “jsonTest”.

Step 4: Verify data in Hive

Once done with step 3. Let’s verify the hive table in database bdp_db. I am already under bdp_db database. So directly checking the table.

 hive> show tables;
OK
jsontest
Time taken: 0.111 seconds, Fetched: 1 row(s)

To describe the table use the below query:

 hive> desc formatted jsontest;
OK
# col_name data_type comment
country string
designation string
email string
first_name string
gender string
id bigint
last_name string
phone string
# Detailed Table Information
Database: bdp_db
Owner: bdp
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://user/bdp/db/jsontest
Table Type: MANAGED_TABLE
Table Parameters:
 COLUMN_STATS_ACCURATE false
 EXTERNAL FALSE
 numFiles 1
 numRows -1
 rawDataSize -1
 spark.sql.sources.provider orc
 spark.sql.sources.schema.numParts 1
 spark.sql.sources.schema.part.0 {\"type\":\"struct\",\"fields\":[{\"name\":\"country\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"designation\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"email\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"first_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"gender\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"last_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"phone\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}
 totalSize 28195
 transient_lastDdlTime 1489597485
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
 path hdfs://user/xdishaw/db/jsontest
 serialization.format 1
Time taken: 0.146 seconds, Fetched: 43 row(s)

Check data in hive table:

 --#Check data
hive> select * from jsontest limit 3;

Wrapping Up

Here we directly loaded JSON data into a Spark data frame. Once the data is available in the data frame, we can process it with transformation and action. The saveAsTable function is used to save the content of the data frame into the hive table.

2 Comments

Is it needed to create output table first. In my case i am facing issue like can’t read table ,table do not have sequence file..pls help me on this.
Sruthi
April 28, 2017
Log in to Reply
Load JSON Data into Hive Partitioned table using Spark - BIG DATA PROGRAMMERS
May 6, 2017

Load JSON Data in Hive non-partitioned table using Spark