Requirement
Suppose there is a source data which is in JSON format. The requirement is to load JSON data into Hive non-partitioned table using Spark.
Let’s break the requirement into two tasks:
- Load JSON data in spark data frame and read it
- Store into hive non-partition table
Components Involved
To achieve the requirement, below components will be used:
- Hive – It is used to store data in a non-partitioned table with ORC file format.
- Spark SQL – It is used to load the JSON data, process and store into the hive.
Here, the hive table will be a non-partitioned table and will store the data in ORC format.
Solution
Step 1: JSON sample data
The sample of JSON formatted data:
[{"id":1,"first_name":"Steven","last_name":"Garrett","email":"sgarrett0@amazon.co.uk","gender":"Male","designation":"Design Engineer","phone":"62-(454)694-4544","country":"Indonesia"},{"id":2,"first_name":"Brandon","last_name":"Green","email":"bgreen1@fc2.com","gender":"Male","designation":"Graphic Designer","phone":"7-(587)965-1714","country":"Russia"},{"id":3,"first_name":"Martha","last_name":"Bennett","email":"mbennett2@abc.net.au","gender":"Female","designation":"Recruiting Manager","phone":"48-(582)234-3809","country":"Poland"},{"id":4,"first_name":"Samuel","last_name":"Lopez","email":"slopez3@usgs.gov","gender":"Male","designation":"Automation Specialist I","phone":"33-(654)376-1795","country":"France"},{"id":5,"first_name":"Judy","last_name":"Bishop","email":"jbishop4@163.com","gender":"Female","designation":"Librarian","phone":"81-(509)760-1241","country":"Japan"},{"id":6,"first_name":"Anna","last_name":"Morales","email":"amorales5@eventbrite.com","gender":"Female","designation":"Assistant Professor","phone":"33-(675)922-1030","country":"France"},{"id":7,"first_name":"Benjamin","last_name":"Walker","email":"bwalker6@furl.net","gender":"Male","designation":"Computer Systems Analyst II","phone":"86-(249)310-6467","country":"China"},{"id":8,"first_name":"Sean","last_name":"Perkins","email":"sperkins7@usatoday.com","gender":"Male","designation":"Nurse","phone":"1-(504)398-8997","country":"Canada"}]
Download sample data from the below link:
sample json dataOnce you download the file, keep it in the local path. In my case, the file path is “home/bdp/My_Work_Book/Spark/jsondata.json”.
Step 2: Create hive database
First, open the hive CLI and create a database using below command. If the database is available, then select the database using “use” command.
hive> create database bdp_db; OK Time taken: 0.024 seconds hive> use bdp_db; OK Time taken: 0.018 seconds
Step 3: Execution on spark-shell
In this step, we will import JSON in hive using spark SQL. First, have to start the spark command line. Here I am using the pyspark command to start.
Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.6.1 /_/ Using Python version 2.7.12 (default, Jun 29 2016 11:08:50) SparkContext available as sc, HiveContext available as sqlContext. >>>
Now, I am having a pyspark shell. If you notice in the last line, it has already created Spark Context as sc and Hive Context as sqlContext.
>>> import json >>> from pyspark.sql import HiveContext >>> hiveContext = HiveContext(sc) >>> jsonDF = hiveContext.read.json('file:///home/bdp/My_Work_Book/Spark/jsondata.json')
Here, I have imported JSON library to parse JSON file. I am creating HiveContext from the SparkContext. In the last line, we are loading the JSON file. The JSON file path is the local path where the JSON file exists. This will return a data frame.
To check the schema of the data frame:
>>> jsonDF.printSchema()
There is a function called “show”. It will show data frame records. In addition, it takes an optional argument. This argument is nothing but to limit the records to show.
>>> jsonDF.show(3)
Now, we are done with the first task. The second task is to save the data frame in hive table with “ORC” formatted.
>>> jsonDF.write.format("orc").saveAsTable("bdp_db.jsonTest") INFO HiveContext$$anon$2: Persisting data source relation `bdp_db`.`jsonTest` with a single input path into Hive metastore in Hive compatible format. Input path: hdfs://user/bdp/db/jsontest.
In the above command, using format to specify the format of the storage and saveAsTable to save the data frame as a hive table. That’s it. It will store the data frame into hive database bdp_db with the table name “jsonTest”.
Step 4: Verify data in Hive
Once done with step 3. Let’s verify the hive table in database bdp_db. I am already under bdp_db database. So directly checking the table.
hive> show tables; OK jsontest Time taken: 0.111 seconds, Fetched: 1 row(s)
To describe the table use the below query:
hive> desc formatted jsontest; OK # col_name data_type comment country string designation string email string first_name string gender string id bigint last_name string phone string # Detailed Table Information Database: bdp_db Owner: bdp LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://user/bdp/db/jsontest Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE false EXTERNAL FALSE numFiles 1 numRows -1 rawDataSize -1 spark.sql.sources.provider orc spark.sql.sources.schema.numParts 1 spark.sql.sources.schema.part.0 {\"type\":\"struct\",\"fields\":[{\"name\":\"country\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"designation\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"email\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"first_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"gender\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"last_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"phone\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]} totalSize 28195 transient_lastDdlTime 1489597485 # Storage Information SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: path hdfs://user/xdishaw/db/jsontest serialization.format 1 Time taken: 0.146 seconds, Fetched: 43 row(s)
Check data in hive table:
--#Check data hive> select * from jsontest limit 3;
Wrapping Up
Here we directly loaded JSON data into a Spark data frame. Once the data is available in the data frame, we can process it with transformation and action. The saveAsTable function is used to save the content of the data frame into the hive table.
Is it needed to create output table first. In my case i am facing issue like can’t read table ,table do not have sequence file..pls help me on this.