How to read JSON file in Spark

Requirement

Let’s say we have a set of data which is in JSON format. The file may contain data either in a single line or in a multi-line. The requirement is to process these data using the Spark data frame.

In addition to this, we will also see how to compare two data frame and other transformations.

Sample Data

JSON data in a single line:

 
 
  1. [{"empno":"7369","ename":"SMITH","designation":"CLERK","manager":"7902","hire_date":"12/17/1980","sal":"800","deptno":"20" },{"empno":"7499","ename":"ALLEN","designation":"SALESMAN","manager":"7698","hire_date":"2/20/1981","sal":"1600","deptno":"30"},{"empno":"7521","ename":"WARD","designation":"SALESMAN","manager":"7698","hire_date":"2/22/1981","sal":"1250","deptno":"30"  },{"empno":"7566","ename":"TURNER","designation":"MANAGER","manager":"7839","hire_date":"4/2/1981","sal":"2975","deptno":"20"},{"empno":"7654","ename":"MARTIN","designation":"SALESMAN","manager":"7698","hire_date":"9/28/1981","sal":"1250","deptno":"30"},{"empno":"7698","ename":"MILLER","designation":"MANAGER","manager":"7839","hire_date":"5/1/1981","sal":"2850","deptno":"30"},{"empno":"7782","ename":"CLARK","designation":"MANAGER","manager":"7839","hire_date":"6/9/1981","sal":"2450","deptno":"10"},{"empno":"7788","ename":"SCOTT","designation":"ANALYST","manager":"7566","hire_date":"12/9/1982","sal":"3000","deptno":"20"},{"empno":"7839","ename":"KING","designation":"PRESIDENT","manager":"NULL","hire_date":"11/17/1981","sal":"5000","deptno":"10"}]

JSON data with multiline:

 
 
  1. [
  2. {
  3. "empno":"7369",
  4. "ename":"SMITH",
  5. "designation":"CLERK",
  6. "manager":"7902",
  7. "hire_date":"12/17/1980",
  8. "sal":"800",
  9. "deptno":"20"
  10. },
  11. {
  12. "empno":"7499",
  13. "ename":"ALLEN",
  14. "designation":"SALESMAN",
  15. "manager":"7698",
  16. "hire_date":"2/20/1981",
  17. "sal":"1600",
  18. "deptno":"30"
  19. },
  20. {
  21. "empno":"7521",
  22. "ename":"WARD",
  23. "designation":"SALESMAN",
  24. "manager":"7698",
  25. "hire_date":"2/22/1981",
  26. "sal":"1250",
  27. "deptno":"30"
  28. },
  29. {
  30. "empno":"7566",
  31. "ename":"TURNER",
  32. "designation":"MANAGER",
  33. "manager":"7839",
  34. "hire_date":"4/2/1981",
  35. "sal":"2975",
  36. "deptno":"20"
  37. },
  38. {
  39. "empno":"7654",
  40. "ename":"MARTIN",
  41. "designation":"SALESMAN",
  42. "manager":"7698",
  43. "hire_date":"9/28/1981",
  44. "sal":"1250",
  45. "deptno":"30"
  46. },
  47. {
  48. "empno":"7698",
  49. "ename":"MILLER",
  50. "designation":"MANAGER",
  51. "manager":"7839",
  52. "hire_date":"5/1/1981",
  53. "sal":"2850",
  54. "deptno":"30"
  55. },
  56. {
  57. "empno":"7782",
  58. "ename":"CLARK",
  59. "designation":"MANAGER",
  60. "manager":"7839",
  61. "hire_date":"6/9/1981",
  62. "sal":"2450",
  63. "deptno":"10"
  64. },
  65. {
  66. "empno":"7788",
  67. "ename":"SCOTT",
  68. "designation":"ANALYST",
  69. "manager":"7566",
  70. "hire_date":"12/9/1982",
  71. "sal":"3000",
  72. "deptno":"20"
  73. },
  74. {
  75. "empno":"7839",
  76. "ename":"KING",
  77. "designation":"PRESIDENT",
  78. "manager":"NULL",
  79. "hire_date":"11/17/1981",
  80. "sal":"5000",
  81. "deptno":"10"
  82. }
  83. ]

You can download the sample data from below:

Solution

Step 1: Setup

We will use the given sample data in the code. You can download the data from here and keep at any location. In my case, I have kept these file at ‘/home/bdp/data/employees_singleLine.json’ and ‘/home/bdp/data/employees_multiLine.json’

Step 2: Write Code and Execute

Once the spark-shell open, you can load the JSON data using the below command:

 
 
  1. // Load json data:
  2. scala> val jsonData_1 = sqlContext.read.json("file:///home/bdp/data/employees_singleLine.json")
  3. // Check schema
  4. scala> jsonData_1.printSchema()

Here, We have loaded the JSON file data available at the local path. Now, load another JSON file data which are in multi-line.

 
 
  1. scala> val jsonData_2 = sqlContext.read.json(sc.wholeTextFiles("file:///home/bdp/data/employees_multiLine.json").values)
  2. scala> jsonData_2.printSchema

Compare both the JSON data:

 
 
  1. scala> jsonData_1.except(jsonData_2).show

Here, except function has used to compare both the data frame.

Check the data frame data.

 
 
  1. // Check Data
  2. scala> jsonData_1.show()
 
 
  1. // Get ename
  2. scala> jsonData_1.select("ename").show
 
 
  1. // Get Distinct deptno
  2. scala> jsonData_1.select("deptno").distinct.show

Alternatively, we can create a temp table and execute the query on the table:

 
 
  1. // Register a table
  2. scala> jsonData_1.registerTempTable("employeeTbl")
  3. // Get Distinct deptno using query on table
  4. scala> sqlContext.sql("select distinct deptno from employeeTbl").show

Note: Table name is case sensitive. It table name should be the same in the query as it has been created.

All the command used for the processing:

 
 
  1. // Load JSON data:
  2. scala> val jsonData_1 = sqlContext.read.json("file:///home/bdp/data/employees_singleLine.json")
  3. scala> val jsonData_2 = sqlContext.read.json(sc.wholeTextFiles("file:///home/bdp/data/employees_multiLine.json").values)
  4. // Check the schema
  5. scala> jsonData_1.printSchema()
  6. scala> jsonData_2.printSchema()
  7. // Compare the data frame
  8. scala> jsonData_1.except(jsonData_2).show
  9. // Check Data
  10. scala> jsonData_1.show()
  11. // Get ename
  12. scala> jsonData_1.select("ename").show
  13. // Get Distinct deptno
  14. scala> jsonData_1.select("deptno").distinct.show
  15. // Register a table
  16. scala> jsonData_1.registerTempTable("employeeTbl")
  17. // Get Distinct deptno using query on table
  18. scala> sqlContext.sql("select distinct deptno from employeeTbl").show

You can download the script from the below link:

In Spark 2.0:

Load the JSON file data using below command:

 
 
  1. scala> spark.read.option("multiLine", true).option("mode", "PERMISSIVE").json("file:///home/bdp/data/employees_multiLine.json").show

Wrapping Up

In this post, we have gone through how to parse the JSON format data which can be either in a single line or in multi-line. We also have seen how to fetch a specific column from the data frame directly and also by creating a temp table. The except function have used to compare two data frame in order to check both are having the same data or not. You can explore more by doing more transformation and action on the created data frame. 

Load CSV file into hive AVRO table

Requirement You have comma separated(CSV) file and you want to create Avro table in hive on top of it, then ...
Read More

Load CSV file into hive PARQUET table

Requirement You have comma separated(CSV) file and you want to create Parquet table in hive on top of it, then ...
Read More

Hive Most Asked Interview Questions With Answers – Part II

What is bucketing and what is the use of it? Answer: Bucket is an optimisation technique which is used to ...
Read More
/ hive, hive interview, interview-qa

Spark Interview Questions Part-1

Suppose you have a spark dataframe which contains millions of records. You need to perform multiple actions on it. How ...
Read More

Leave a Reply