Load JSON Data in Hive non-partitioned table using Spark

Requirement

Suppose there is a source data which is in JSON format. The requirement is to load JSON data in Hive non-partitioned table using Spark.

Let’s break the requirement into two task:

  • Load JSON data in spark data frame and read it
  • Store it in a hive non-partition table

Components Involved

To achieve the requirement, below components will be used:

  • Hive – It is used to store data in non-partitioned and with ORC file format.
  • Spark SQL – It is used to load the JSON data, process and store into the hive.

Here hive table will be a non-partitioned table and will store the data in an ORC format.

Solution

Step 1: JSON sample data

The sample of JSON formatted data:

json sample data
 
  1. [{"id":1,"first_name":"Steven","last_name":"Garrett","email":"sgarrett0@amazon.co.uk","gender":"Male","designation":"Design Engineer","phone":"62-(454)694-4544","country":"Indonesia"},{"id":2,"first_name":"Brandon","last_name":"Green","email":"bgreen1@fc2.com","gender":"Male","designation":"Graphic Designer","phone":"7-(587)965-1714","country":"Russia"},{"id":3,"first_name":"Martha","last_name":"Bennett","email":"mbennett2@abc.net.au","gender":"Female","designation":"Recruiting Manager","phone":"48-(582)234-3809","country":"Poland"},{"id":4,"first_name":"Samuel","last_name":"Lopez","email":"slopez3@usgs.gov","gender":"Male","designation":"Automation Specialist I","phone":"33-(654)376-1795","country":"France"},{"id":5,"first_name":"Judy","last_name":"Bishop","email":"jbishop4@163.com","gender":"Female","designation":"Librarian","phone":"81-(509)760-1241","country":"Japan"},{"id":6,"first_name":"Anna","last_name":"Morales","email":"amorales5@eventbrite.com","gender":"Female","designation":"Assistant Professor","phone":"33-(675)922-1030","country":"France"},{"id":7,"first_name":"Benjamin","last_name":"Walker","email":"bwalker6@furl.net","gender":"Male","designation":"Computer Systems Analyst II","phone":"86-(249)310-6467","country":"China"},{"id":8,"first_name":"Sean","last_name":"Perkins","email":"sperkins7@usatoday.com","gender":"Male","designation":"Nurse","phone":"1-(504)398-8997","country":"Canada"}]

Download sample data from below link:

sample json data

Once you download the file, keep it in local path. In my case, the file path is “home/bdp/My_Work_Book/Spark/jsondata.json”.

Step 2: Verify hive database

First, open the hive CLI and verify the database. The hive table will create under this database. If database available, then select the database using “use” command.

hive db
 
  1. hive> USE bdp_db;
  2. OK
  3. TIME taken: 0.024 seconds
  4. hive> SHOW TABLES;
  5. OK
  6. TIME taken: 0.274 seconds

Here, I have used directly “USE” command of the hive to select database. In case, the database does not exist, it will give an error. So to create a database you can use CREATE DATABASE <database_name>. “Show Tables” command will give a list of existing tables under selected Database.

Step 3: Execution on spark-shell

In this step, we will import JSON in hive using spark SQL. First, have to start spark command line. Here I am using pyspark command to start.

load json data
 
  1. Welcome to
  2.       ____              __
  3.      / __/__  ___ _____/ /__
  4.     _\ \/ _ \/ _ `/ __/  '_/
  5.    /__ / .__/\_,_/_/ /_/\_  version 1.6.1
  6.       /_/
  7. Using Python version 2.7.12 (default, Jun 29 2016 11:08:50)
  8. SparkContext available as sc, HiveContext available as sqlContext.
  9. >>>

Now, I am having pyspark shell. If you notice in the last line, it has already created Spark Context as sc and Hive Context as sqlContext.

 
 
  1. >>> import json
  2. >>> from pyspark.sql import HiveContext
  3. >>> hiveContext = HiveContext(sc)
  4. >>> jsonDF = hiveContext.read.json('file:///home/bdp/My_Work_Book/Spark/jsondata.json')

Here, I have imported JSON library to parse JSON file. I am creating HiveContext from the SparkContext.  In the last line, we are loading the JSON file. The JSON file path is the local path where JSON file exists. This will return a data frame.

To check the schema of the data frame:

dataframe schema
 
  1. >>> jsonDF.printSchema()

 There is a function called “show”. It will show data frame records. In addition, it takes an optional argument. This argument is nothing but to limit the records to show.

show data
 
  1. >>> jsonDF.show(3)

Now, we are done with the first task. The second task is to save the data frame in hive table with “” formatted.

save dataframe in hive
 
  1. >>> jsonDF.write.format("orc").saveAsTable("bdp_db.jsonTest")
  2. INFO HiveContext$$anon$2: Persisting data source relation `bdp_db`.`jsonTest` with a single input path into Hive metastore in Hive compatible format. Input path: hdfs://user/bdp/db/jsontest.

In the above command, using format to specify the format of the storage and saveAsTable to save the data frame as a hive table. That’s it. It will store the data frame date into hive database bdp_db with the table name “jsonTest”.

Step 4: Verify data in Hive

Once done with step 3. Let’s verify the hive table in database bdp_db. I am already under bdp_db database. So directly checking the table.

hive query
 
  1. hive> show tables;
  2. OK
  3. jsontest
  4. Time taken: 0.111 seconds, Fetched: 1 row(s)

To describe the table use below query:

 
 
  1. hive> DESC formatted jsontest;
  2. OK
  3. # col_name data_type comment
  4. country string
  5. designation string
  6. email string
  7. first_name string
  8. gender string
  9. id BIGINT
  10. last_name string
  11. phone string
  12. # Detailed TABLE Information
  13. DATABASE: bdp_db
  14. Owner: bdp
  15. LastAccessTime: UNKNOWN
  16. Protect Mode: NONE
  17. Retention: 0
  18. Location: hdfs://USER/bdp/db/jsontest
  19. TABLE TYPE: MANAGED_TABLE
  20. TABLE Parameters:
  21.  COLUMN_STATS_ACCURATE FALSE
  22.  EXTERNAL FALSE
  23.  numFiles 1
  24.  numRows -1
  25.  rawDataSize -1
  26.  spark.SQL.sources.provider orc
  27.  spark.SQL.sources.schema.numParts 1
  28.  spark.SQL.sources.schema.part.0 {\"type\":\"struct\",\"fields\":[{\"name\":\"country\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"designation\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"email\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"first_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"gender\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"last_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"phone\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}
  29.  totalSize 28195
  30.  transient_lastDdlTime 1489597485
  31. # Storage Information
  32. SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
  33. InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
  34. OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
  35. Compressed: No
  36. Num Buckets: -1
  37. Bucket Columns: []
  38. Sort Columns: []
  39. Storage Desc Params:
  40.  path hdfs://user/xdishaw/db/jsontest
  41.  serialization.format 1
  42. Time taken: 0.146 seconds, Fetched: 43 row(s)

Check data in hive table.

select
 
  1. --#Check data
  2. hive> SELECT * FROM jsontest LIMIT 3;

Wrapping Up

Here we directly loaded JSON data into a Spark data frame. Once the data is available in the data frame, we can process it with transformation and action. The saveAsTable function gets used to save the content of the data frame in a table.

Load CSV file into hive AVRO table

Requirement You have comma separated(CSV) file and you want to create Avro table in hive on top of it, then ...
Read More

Load CSV file into hive PARQUET table

Requirement You have comma separated(CSV) file and you want to create Parquet table in hive on top of it, then ...
Read More

Hive Most Asked Interview Questions With Answers – Part II

What is bucketing and what is the use of it? Answer: Bucket is an optimisation technique which is used to ...
Read More
/ hive, hive interview, interview-qa

Spark Interview Questions Part-1

Suppose you have a spark dataframe which contains millions of records. You need to perform multiple actions on it. How ...
Read More

2 Comments

  1. Is it needed to create output table first. In my case i am facing issue like can’t read table ,table do not have sequence file..pls help me on this.


Leave a Reply