Parse XML data in Hive

Requirement

Suppose you are having an XML formatted data file. This file contains some empty tag. The requirement is to parse XML data in Hive and assign any default value to the empty tags.

Components Involved

  • Hive
  • Maven
  • Java

Solution

There are many solutions for parsing XML data into hive table. One solution out of these uses SerDe properties in ROW FORMAT by adding a hivexmlserde jar file. Another solution is to store XML files as a single string data into a hive temp table and then fetch the data of each tag using XPATH.

Download below Sample XML data and keep it at the local location in your system:

sample

Step 1: Create a temp hive table

CREATE EXTERNAL TABLE companyxml(xmldata STRING) LOCATION '/user/hive/companyxml/company.xml';

In this step, we have created a temp table which stores  XML data as a single record. The location is the local path of the XML file.

Step 2:  Create View

CREATE VIEW companyview
(id,name,email,houseno,street,city,state,pincode,country,passport,visa,mobile,phone)
AS SELECT
xpath(xmldata,'Company/Employee/Id/text()'),
xpath(xmldata,'Company/Employee/Name/text()'),
xpath(xmldata,'Company/Employee/Email/text()'),
xpath(xmldata,'Company/Employee/Address/HouseNo/text()'),
xpath(xmldata,'Company/Employee/Address/Street/text()'),
xpath(xmldata,'Company/Employee/Address/City/text()'),
xpath(xmldata,'Company/Employee/Address/State/text()'),
xpath(xmldata,'Company/Employee/Address/Pincode/text()'),
xpath(xmldata,'Company/Employee/Address/Country/text()'),
xpath(xmldata,'Company/Employee/Passport/text()'),
xpath(xmldata,'Company/Employee/Visa/text()'),
xpath(xmldata,'Company/Employee/Contact/Mobile/text()'),
xpath(xmldata,'Company/Employee/Contact/Phone/text()')
FROM companyxml;

Here, I am creating a view say companyview.  This view is used to parse each tag value from the temp table.  In order to get the tag value of the XML, we can use XPath.

Step 3: Output

Let’s check how the view looks like:

 SELECT * FROM companyview;
["458790","458791"]     ["Sameer","Gohar"]      ["sameer@email.com","Gohar@email.com"]  ["105","485"]      ["Grand Road","Camac Street Road"]      ["Bangalore","Mumbai"]  ["Karnataka","Maharastra"] ["560068","400001"]     ["India","India"]       ["Available","Available"] []       ["9909999999","9908888888"]     ["8044552266"]
Time taken: 0.41 seconds, Fetched: 1 row(s)

In the above output, the tag phone has only one value as only one employee has a phone number and visa value is an empty array (shown by highlighted). That means it has empty tag values.

XPATH returns an Array list of data as shown above. In order to assign a default value to the empty tags, we need to write a custom hive UDF.

The logic behind the solution:

The problem is arising when your XML data file having an empty tag. So modify this XML data before passing to XPath and provide any value like blank, NULL for the empty tag. In this case, we will replace <phone/> by <phone>NULL</phone>.

Create a maven project in eclipse. And create a java file name XmlEmptyParse. Find the java code.

UDF Java Code:

import java.util.HashMap;
import java.util.Map; 
import org.apache.hadoop.hive.ql.exec.UDF; 
public class XmlEmptyParse extends UDF{ 
public String evaluate(String xmlData){ 
    String replaceValue = "NULL"; 
    Map<Integer, StringBuffer> xmlMap = new HashMap<Integer, StringBuffer>(); 
    xmlMap.put(1, new StringBuffer(xmlData)); 
    while (xmlMap.get(1).toString().contains("/>")) { 
        int index = xmlMap.get(1).toString().indexOf("/>"); 
        String sm = xmlMap.get(1).toString().substring(0, index); 
        int firstIndex = sm.lastIndexOf("<"); 
        String temp = xmlMap.get(1).toString().replace(xmlMap.get(1).toString().substring(firstIndex, index + 2), 
            "<" + xmlMap.get(1).toString().substring(firstIndex + 1, 
            index) + ">" + replaceValue + "</" + xmlMap.get(1).toString().substring(firstIndex + 1, index) + ">"); 
        xmlMap.put(1, new StringBuffer(temp)); 
        } 
        return xmlMap.get(1).toString(); 
    } 
}

Once done with the java code. Export the maven project as a jar file and keep it at a local location. I have exported this as an XmlParseUdf-0.0.1-SNAPSHOT.jar. Now, we are done with the UDF code. Next, we are going to use this jar to use the UDF function.

Step 4: Use UDF

a) Add JAR

First, need to add the jar to the hive using the below query:
ADD JAR [created_jar_location]
In my case, it is ADD JAR /home/NN/HadoopRepo/Hive/udf/XmlParseUdf-0.0.1-SNAPSHOT.jar;

hive> ADD JAR /home/NN/HadoopRepo/Hive/udf/XmlParseUdf-0.0.1-SNAPSHOT.jar;
Added [/home/NN/HadoopRepo/Hive/udf/XmlParseUdf-0.0.1-SNAPSHOT.jar] to class path
Added resources: [/home/NN/HadoopRepo/Hive/udf/XmlParseUdf-0.0.1-SNAPSHOT.jar]

b) Create a Temporary function

 hive> CREATE TEMPORARY FUNCTION xmlUDF as 'hive.udf.XmlEmptyParse';
OK 
Time taken: 0.802 seconds

Next, in order to use our custom UDF function, it is required to create a temporary function. This temporary function is used with the same name in the hive query.  So, here I have created a temporary function say xmlUDF.

c) Use created function

Finally, we have created UDF and also created a temporary function. In this step, we will use the temporary function which has been created in the previous step.

hive> CREATE VIEW companyview(id,name,email,houseno,street,city,state,pincode,country,passport,visa,mobile,phone)
AS SELECT
xpath(xmldata,'Company/Employee/Id/text()'),
xpath(xmldata,'Company/Employee/Name/text()'),
xpath(xmldata,'Company/Employee/Email/text()'),
xpath(xmldata,'Company/Employee/Address/HouseNo/text()'),xpath(xmldata,'Company/Employee/Address/Street/text()'),xpath(xmldata,'Company/Employee/Address/City/text()'),
xpath(xmldata,'Company/Employee/Address/State/text()'),
xpath(xmldata,'Company/Employee/Address/Pincode/text()'),xpath(xmldata,'Company/Employee/Address/Country/text()'),xpath(xmldata,'Company/Employee/Passport/text()'),
xpath(xmlUDF(xmldata),'Company/Employee/Visa/text()'),
xpath(xmldata,'Company/Employee/Contact/Mobile/text()'),
xpath(xmlUDF(xmldata),'Company/Employee/Contact/Phone/text()')
FROM companyxml;
OK
Time taken: 0.727 seconds

Here, I have used UDF for Visa and Phone as these two had an empty tag. Now check the stored data into view.

Step 5: Validate output

hive> select * from companyview;
OK
["458790","458791"]     ["Sameer","Gohar"]      ["sameer@email.com","Gohar@email.com"]  ["105","485"]   ["Grand Road","Camac Street Road"]        ["Bangalore","Mumbai"]  ["Karnataka","Maharastra"]      ["560068","400001"]     ["India","India"]["Available","Available"]        ["NULL","NULL"] ["9909999999","9908888888"]     ["8044552266","NULL"]
Time taken: 1.184 seconds, Fetched: 1 row(s)

Wrapping Up

Here, you have seen how to handle if the XML data file contains one or many empty tags. This approach is suitable to assign any default values to empty tags.

Keep Learning 🙂

Sharing is caring!

Subscribe to our newsletter
Loading

Leave a Reply