Parse XML data in Hive

Requirement

Suppose you are having an XML formatted data file. This source file contains some empty tag. The requirement is to parse XML data in Hive and read data with handling some tag which is empty in the source data.

Components Involved

  • Hive
  • Maven
  • Java

Solution

There are many solutions for parsing XML data into hive table. One solution out of these uses SerDe properties in ROW FORMAT by adding a hivexmlserde jar file. Another solution is to store XML files as a single string data into a temp hive table and then fetch the data of each tag using XPATH.

Download below Sample XML data and keep it at the local location in your system:

sample

Step 1: Create a temp hive table

create table
 
  1. CREATE EXTERNAL TABLE companyxml(xmldata STRING) LOCATION '/user/hive/companyxml/company.xml';

In this step, we have created a temp table which stores  XML data as a single record. The location is the local path of the XML file.

Step 2:  Create View

load data
 
  1. DROP VIEW companyview;
  2. CREATE VIEW companyview
  3. (id,name,email,houseno,street,city,state,pincode,country,passport,visa,mobile,phone)
  4. AS SELECT
  5. xpath(xmldata,'Company/Employee/Id/text()'),
  6. xpath(xmldata,'Company/Employee/Name/text()'),
  7. xpath(xmldata,'Company/Employee/Email/text()'),
  8. xpath(xmldata,'Company/Employee/Address/HouseNo/text()'),
  9. xpath(xmldata,'Company/Employee/Address/Street/text()'),
  10. xpath(xmldata,'Company/Employee/Address/City/text()'),
  11. xpath(xmldata,'Company/Employee/Address/State/text()'),
  12. xpath(xmldata,'Company/Employee/Address/Pincode/text()'),
  13. xpath(xmldata,'Company/Employee/Address/Country/text()'),
  14. xpath(xmldata,'Company/Employee/Passport/text()'),
  15. xpath(xmldata,'Company/Employee/Visa/text()'),
  16. xpath(xmldata,'Company/Employee/Contact/Mobile/text()'),
  17. xpath(xmldata,'Company/Employee/Contact/Phone/text()')
  18. FROM companyxml;

Here, I am creating a view say companyview.  This view is used to parse each tag value from temp table.  In order to reach to the tag and get the tag value of the XML, we can use XPath.

Output:

Let’s check how the view looks like:

select data
 
  1. SELECT * FROM companyview;
  2. ["458790","458791"]     ["Sameer","Gohar"]      ["sameer@email.com","Gohar@email.com"]  ["105","485"]      ["Grand Road","Camac Street Road"]      ["Bangalore","Mumbai"]  ["Karnataka","Maharastra"] ["560068","400001"]     ["India","India"]       ["Available","Available"] <span STYLE="color: #000000;"><strong>[]</strong></span>       ["9909999999","9908888888"]     ["8044552266"]
  3. TIME taken: 0.41 seconds, Fetched: 1 ROW(s)

In the above output, the tag phone has only one value as only one employee has a phone number and visa value is an empty array (shown by highlighted). That means it has neglected all the empty tag values.

XPATH returns an Array list of data and skips all the empty tags as shown above. So to handle this problem need to write a custom hive UDF.

The logic behind the solution:

The problem is arising when your XML data file having an empty tag. So modify this XML data before passing to XPath and provide any value like blank, NULL for the empty tag. In this case, we will replace <phone/> by <phone>NULL</phone>.

Create a maven project in eclipse. And create a java file name XmlEmptyParse. Find the java code.

UDF Java Code:

UDF Java
 
  1. import java.util.HashMap;
  2. import java.util.Map;
  3. import org.apache.hadoop.hive.ql.exec.UDF;
  4. public class XmlEmptyParse extends UDF{
  5. public String evaluate(String xmlData){
  6.     String replaceValue = "NULL";
  7.     Map<Integer, StringBuffer> xmlMap = new HashMap<Integer, StringBuffer>();
  8.     xmlMap.put(1, new StringBuffer(xmlData));
  9.     while (xmlMap.get(1).toString().contains("/>")) {
  10.         int index = xmlMap.get(1).toString().indexOf("/>");
  11.         String sm = xmlMap.get(1).toString().substring(0, index);
  12.         int firstIndex = sm.lastIndexOf("<");
  13.         String temp = xmlMap.get(1).toString().replace(xmlMap.get(1).toString().substring(firstIndex, index + 2),
  14.             "<" + xmlMap.get(1).toString().substring(firstIndex + 1,
  15.             index) + ">" + replaceValue + "</" + xmlMap.get(1).toString().substring(firstIndex + 1, index) + ">");
  16.         xmlMap.put(1, new StringBuffer(temp));
  17.         }
  18.         return xmlMap.get(1).toString();
  19.     }
  20. }

Once done with the java code. Export the maven project as a jar file and keep it at a local location. I have exported this as a XmlParseUdf-0.0.1-SNAPSHOT.jar. Now, we are done with UDF code. Next, we are going to use this jar to utilize created UDF function.

Use UDF

Step 1: Add JAR

First, need to add the jar to the hive using below query.
ADD JAR [created_jar_location]
In my case, it is ADD JAR /home/NN/HadoopRepo/Hive/udf/XmlParseUdf-0.0.1-SNAPSHOT.jar;

add jar
 
  1. hive> ADD JAR /home/NN/HadoopRepo/Hive/udf/XmlParseUdf-0.0.1-SNAPSHOT.jar;
  2. Added [/home/NN/HadoopRepo/Hive/udf/XmlParseUdf-0.0.1-SNAPSHOT.jar] TO class path
  3. Added resources: [/home/NN/HadoopRepo/Hive/udf/XmlParseUdf-0.0.1-SNAPSHOT.jar]

Step 2: Create a Temporary function

temporary funcion
 
  1. hive> CREATE TEMPORARY FUNCTION xmlUDF AS 'hive.udf.XmlEmptyParse';
  2. OK
  3. TIME taken: 0.802 seconds

Next, in order to use our created UDF function, it is required to create a temporary function. This temporary function is used with the same name in the hive query.  So, here I have created temporary function say xmlUDF.

Step 3: Use created function

Finally, we have created UDF and also created a temporary function. In this step, we will use the temporary function which has been created in the previous step.

use function
 
  1. hive> DROP VIEW companyview;
  2. OK
  3. TIME taken: 2.793 seconds
  4. hive> CREATE VIEW companyview(id,name,email,houseno,street,city,state,pincode,country,passport,visa,mobile,phone)
  5. AS SELECT
  6. xpath(xmldata,'Company/Employee/Id/text()'),
  7. xpath(xmldata,'Company/Employee/Name/text()'),
  8. xpath(xmldata,'Company/Employee/Email/text()'),
  9. xpath(xmldata,'Company/Employee/Address/HouseNo/text()'),xpath(xmldata,'Company/Employee/Address/Street/text()'),xpath(xmldata,'Company/Employee/Address/City/text()'),
  10. xpath(xmldata,'Company/Employee/Address/State/text()'),
  11. xpath(xmldata,'Company/Employee/Address/Pincode/text()'),xpath(xmldata,'Company/Employee/Address/Country/text()'),xpath(xmldata,'Company/Employee/Passport/text()'),
  12. xpath(xmlUDF(xmldata),'Company/Employee/Visa/text()'),
  13. xpath(xmldata,'Company/Employee/Contact/Mobile/text()'),
  14. xpath(xmlUDF(xmldata),'Company/Employee/Contact/Phone/text()')
  15. FROM companyxml;
  16. OK
  17. TIME taken: 0.727 seconds

Here, I have used UDF for Visa and Phone as these two had an empty tag. Now check the stored data into view.

Step 4: Validate output

 
 
  1. hive> SELECT * FROM companyview;
  2. OK
  3. ["458790","458791"]     ["Sameer","Gohar"]      ["sameer@email.com","Gohar@email.com"]  ["105","485"]   ["Grand Road","Camac Street Road"]        ["Bangalore","Mumbai"]  ["Karnataka","Maharastra"]      ["560068","400001"]     ["India","India"]["Available","Available"]        <strong>["NULL","NULL"]</strong> ["9909999999","9908888888"]     ["8044552266","NULL"]
  4. TIME taken: 1.184 seconds, Fetched: 1 ROW(s)

Wrapping Up

Here, you have seen how to handle if XML data file contains one or many empty tags. This approach is suitable to handle these issues.

Keep Learning 🙂

Load CSV file into hive AVRO table

Requirement You have comma separated(CSV) file and you want to create Avro table in hive on top of it, then ...
Read More

Load CSV file into hive PARQUET table

Requirement You have comma separated(CSV) file and you want to create Parquet table in hive on top of it, then ...
Read More

Hive Most Asked Interview Questions With Answers – Part II

What is bucketing and what is the use of it? Answer: Bucket is an optimisation technique which is used to ...
Read More
/ hive, hive interview, interview-qa

Spark Interview Questions Part-1

Suppose you have a spark dataframe which contains millions of records. You need to perform multiple actions on it. How ...
Read More

Leave a Reply