Load JSON Data in Hive non-partitioned table using Spark

Load JSON Data in Hive non-partitioned table using Spark

Requirement

Suppose there is a source data which is in JSON format. The requirement is to load JSON data in Hive non-partitioned table using Spark.

Let’s break the requirement into two task:

  • Load JSON data in spark data frame and read it
  • Store it in a hive non-partition table

Components Involved

To achieve the requirement, below components will be used:

  • Hive – It is used to store data in non-partitioned and with ORC file format.
  • Spark SQL – It is used to load the JSON data, process and store into the hive.

Here hive table will be a non-partitioned table and will store the data in an ORC format.

Solution

Step 1: JSON sample data

The sample of JSON formatted data:

json sample data
 
  1. [{"id":1,"first_name":"Steven","last_name":"Garrett","email":"sgarrett0@amazon.co.uk","gender":"Male","designation":"Design Engineer","phone":"62-(454)694-4544","country":"Indonesia"},{"id":2,"first_name":"Brandon","last_name":"Green","email":"bgreen1@fc2.com","gender":"Male","designation":"Graphic Designer","phone":"7-(587)965-1714","country":"Russia"},{"id":3,"first_name":"Martha","last_name":"Bennett","email":"mbennett2@abc.net.au","gender":"Female","designation":"Recruiting Manager","phone":"48-(582)234-3809","country":"Poland"},{"id":4,"first_name":"Samuel","last_name":"Lopez","email":"slopez3@usgs.gov","gender":"Male","designation":"Automation Specialist I","phone":"33-(654)376-1795","country":"France"},{"id":5,"first_name":"Judy","last_name":"Bishop","email":"jbishop4@163.com","gender":"Female","designation":"Librarian","phone":"81-(509)760-1241","country":"Japan"},{"id":6,"first_name":"Anna","last_name":"Morales","email":"amorales5@eventbrite.com","gender":"Female","designation":"Assistant Professor","phone":"33-(675)922-1030","country":"France"},{"id":7,"first_name":"Benjamin","last_name":"Walker","email":"bwalker6@furl.net","gender":"Male","designation":"Computer Systems Analyst II","phone":"86-(249)310-6467","country":"China"},{"id":8,"first_name":"Sean","last_name":"Perkins","email":"sperkins7@usatoday.com","gender":"Male","designation":"Nurse","phone":"1-(504)398-8997","country":"Canada"}]

Download sample data from below link:

sample json data

Once you download the file, keep it in local path. In my case, the file path is “home/bdp/My_Work_Book/Spark/jsondata.json”.

Step 2: Verify hive database

First, open the hive CLI and verify the database. The hive table will create under this database. If database available, then select the database using “use” command.

hive db
 
  1. hive> USE bdp_db;
  2. OK
  3. TIME taken: 0.024 seconds
  4. hive> SHOW TABLES;
  5. OK
  6. TIME taken: 0.274 seconds

Here, I have used directly “USE” command of the hive to select database. In case, the database does not exist, it will give an error. So to create a database you can use CREATE DATABASE <database_name>. “Show Tables” command will give a list of existing tables under selected Database.

Step 3: Execution on spark-shell

In this step, we will import JSON in hive using spark SQL. First, have to start spark command line. Here I am using pyspark command to start.

load json data
 
  1. Welcome to
  2.       ____              __
  3.      / __/__  ___ _____/ /__
  4.     _\ \/ _ \/ _ `/ __/  '_/
  5.    /__ / .__/\_,_/_/ /_/\_  version 1.6.1
  6.       /_/
  7. Using Python version 2.7.12 (default, Jun 29 2016 11:08:50)
  8. SparkContext available as sc, HiveContext available as sqlContext.
  9. >>>

Now, I am having pyspark shell. If you notice in the last line, it has already created Spark Context as sc and Hive Context as sqlContext.

 
 
  1. >>> import json
  2. >>> from pyspark.sql import HiveContext
  3. >>> hiveContext = HiveContext(sc)
  4. >>> jsonDF = hiveContext.read.json('file:///home/bdp/My_Work_Book/Spark/jsondata.json')

Here, I have imported JSON library to parse JSON file. I am creating HiveContext from the SparkContext.  In the last line, we are loading the JSON file. The JSON file path is the local path where JSON file exists. This will return a data frame.

To check the schema of the data frame:

dataframe schema
 
  1. >>> jsonDF.printSchema()

 There is a function called “show”. It will show data frame records. In addition, it takes an optional argument. This argument is nothing but to limit the records to show.

show data
 
  1. >>> jsonDF.show(3)

Now, we are done with the first task. The second task is to save the data frame in hive table with “” formatted.

save dataframe in hive
 
  1. >>> jsonDF.write.format("orc").saveAsTable("bdp_db.jsonTest")
  2. INFO HiveContext$$anon$2: Persisting data source relation `bdp_db`.`jsonTest` with a single input path into Hive metastore in Hive compatible format. Input path: hdfs://user/bdp/db/jsontest.

In the above command, using format to specify the format of the storage and saveAsTable to save the data frame as a hive table. That’s it. It will store the data frame date into hive database bdp_db with the table name “jsonTest”.

Step 4: Verify data in Hive

Once done with step 3. Let’s verify the hive table in database bdp_db. I am already under bdp_db database. So directly checking the table.

hive query
 
  1. hive> show tables;
  2. OK
  3. jsontest
  4. Time taken: 0.111 seconds, Fetched: 1 row(s)

To describe the table use below query:

 
 
  1. hive> DESC formatted jsontest;
  2. OK
  3. # col_name data_type comment
  4. country string
  5. designation string
  6. email string
  7. first_name string
  8. gender string
  9. id BIGINT
  10. last_name string
  11. phone string
  12. # Detailed TABLE Information
  13. DATABASE: bdp_db
  14. Owner: bdp
  15. LastAccessTime: UNKNOWN
  16. Protect Mode: NONE
  17. Retention: 0
  18. Location: hdfs://USER/bdp/db/jsontest
  19. TABLE TYPE: MANAGED_TABLE
  20. TABLE Parameters:
  21.  COLUMN_STATS_ACCURATE FALSE
  22.  EXTERNAL FALSE
  23.  numFiles 1
  24.  numRows -1
  25.  rawDataSize -1
  26.  spark.SQL.sources.provider orc
  27.  spark.SQL.sources.schema.numParts 1
  28.  spark.SQL.sources.schema.part.0 {\"type\":\"struct\",\"fields\":[{\"name\":\"country\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"designation\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"email\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"first_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"gender\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"last_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"phone\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}
  29.  totalSize 28195
  30.  transient_lastDdlTime 1489597485
  31. # Storage Information
  32. SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
  33. InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
  34. OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
  35. Compressed: No
  36. Num Buckets: -1
  37. Bucket Columns: []
  38. Sort Columns: []
  39. Storage Desc Params:
  40.  path hdfs://user/xdishaw/db/jsontest
  41.  serialization.format 1
  42. Time taken: 0.146 seconds, Fetched: 43 row(s)

Check data in hive table.

select
 
  1. --#Check data
  2. hive> SELECT * FROM jsontest LIMIT 3;

Wrapping Up

Here we directly loaded JSON data into a Spark data frame. Once the data is available in the data frame, we can process it with transformation and action. The saveAsTable function gets used to save the content of the data frame in a table.

40

Join in hive with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in pyspark with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in spark using scala with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

2 Comments

  1. Is it needed to create output table first. In my case i am facing issue like can’t read table ,table do not have sequence file..pls help me on this.

    1

    0

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.