Parse XML data in Hive

Parse XML data in Hive

Requirement

Suppose you are having an XML formatted data file. This source file contains some empty tag. The requirement is to parse XML data in Hive and read data with handling some tag which is empty in the source data.

Components Involved

  • Hive
  • Maven
  • Java

Solution

There are many solutions for parsing XML data into hive table. One solution out of these uses SerDe properties in ROW FORMAT by adding a hivexmlserde jar file. Another solution is to store XML files as a single string data into a temp hive table and then fetch the data of each tag using XPATH.

Download below Sample XML data and keep it at the local location in your system:

sample

Step 1: Create a temp hive table

create table
 
CREATE EXTERNAL TABLE companyxml(xmldata STRING) LOCATION '/user/hive/companyxml/company.xml';

In this step, we have created a temp table which stores  XML data as a single record. The location is the local path of the XML file.

Step 2:  Create View

load data
 
DROP VIEW companyview;
CREATE VIEW companyview
(id,name,email,houseno,street,city,state,pincode,country,passport,visa,mobile,phone)
AS SELECT
xpath(xmldata,'Company/Employee/Id/text()'),
xpath(xmldata,'Company/Employee/Name/text()'),
xpath(xmldata,'Company/Employee/Email/text()'),
xpath(xmldata,'Company/Employee/Address/HouseNo/text()'),
xpath(xmldata,'Company/Employee/Address/Street/text()'),
xpath(xmldata,'Company/Employee/Address/City/text()'),
xpath(xmldata,'Company/Employee/Address/State/text()'),
xpath(xmldata,'Company/Employee/Address/Pincode/text()'),
xpath(xmldata,'Company/Employee/Address/Country/text()'),
xpath(xmldata,'Company/Employee/Passport/text()'),
xpath(xmldata,'Company/Employee/Visa/text()'),
xpath(xmldata,'Company/Employee/Contact/Mobile/text()'),
xpath(xmldata,'Company/Employee/Contact/Phone/text()')
FROM companyxml;

Here, I am creating a view say companyview.  This view is used to parse each tag value from temp table.  In order to reach to the tag and get the tag value of the XML, we can use XPath.

Output:

Let’s check how the view looks like:

select data
 
SELECT * FROM companyview;
["458790","458791"]     ["Sameer","Gohar"]      ["sameer@email.com","Gohar@email.com"]  ["105","485"]      ["Grand Road","Camac Street Road"]      ["Bangalore","Mumbai"]  ["Karnataka","Maharastra"] ["560068","400001"]     ["India","India"]       ["Available","Available"] []       ["9909999999","9908888888"]     ["8044552266"]
Time taken: 0.41 seconds, Fetched: 1 row(s)

In the above output, the tag phone has only one value as only one employee has a phone number and visa value is an empty array (shown by highlighted). That means it has neglected all the empty tag values.

XPATH returns an Array list of data and skips all the empty tags as shown above. So to handle this problem need to write a custom hive UDF.

The logic behind the solution:

The problem is arising when your XML data file having an empty tag. So modify this XML data before passing to XPath and provide any value like blank, NULL for the empty tag. In this case, we will replace <phone/> by <phone>NULL</phone>.

Create a maven project in eclipse. And create a java file name XmlEmptyParse. Find the java code.

UDF Java Code:

UDF Java
 
import java.util.HashMap;
import java.util.Map; 
import org.apache.hadoop.hive.ql.exec.UDF; 
public class XmlEmptyParse extends UDF{ 
public String evaluate(String xmlData){ 
    String replaceValue = "NULL"; 
    Map<Integer, StringBuffer> xmlMap = new HashMap<Integer, StringBuffer>(); 
    xmlMap.put(1, new StringBuffer(xmlData)); 
    while (xmlMap.get(1).toString().contains("/>")) { 
        int index = xmlMap.get(1).toString().indexOf("/>"); 
        String sm = xmlMap.get(1).toString().substring(0, index); 
        int firstIndex = sm.lastIndexOf("<"); 
        String temp = xmlMap.get(1).toString().replace(xmlMap.get(1).toString().substring(firstIndex, index + 2), 
            "<" + xmlMap.get(1).toString().substring(firstIndex + 1, 
            index) + ">" + replaceValue + "</" + xmlMap.get(1).toString().substring(firstIndex + 1, index) + ">"); 
        xmlMap.put(1, new StringBuffer(temp)); 
        } 
        return xmlMap.get(1).toString(); 
    } 
}

Once done with the java code. Export the maven project as a jar file and keep it at a local location. I have exported this as a XmlParseUdf-0.0.1-SNAPSHOT.jar. Now, we are done with UDF code. Next, we are going to use this jar to utilize created UDF function.

Use UDF

Step 1: Add JAR

First, need to add the jar to the hive using below query.
ADD JAR [created_jar_location]
In my case, it is ADD JAR /home/NN/HadoopRepo/Hive/udf/XmlParseUdf-0.0.1-SNAPSHOT.jar;

add jar
 
hive> ADD JAR /home/NN/HadoopRepo/Hive/udf/XmlParseUdf-0.0.1-SNAPSHOT.jar;
Added [/home/NN/HadoopRepo/Hive/udf/XmlParseUdf-0.0.1-SNAPSHOT.jar] to class path
Added resources: [/home/NN/HadoopRepo/Hive/udf/XmlParseUdf-0.0.1-SNAPSHOT.jar]

Step 2: Create a Temporary function

temporary funcion
 
hive> CREATE TEMPORARY FUNCTION xmlUDF as 'hive.udf.XmlEmptyParse';
OK 
Time taken: 0.802 seconds

Next, in order to use our created UDF function, it is required to create a temporary function. This temporary function is used with the same name in the hive query.  So, here I have created temporary function say xmlUDF.

Step 3: Use created function

Finally, we have created UDF and also created a temporary function. In this step, we will use the temporary function which has been created in the previous step.

use function
 
hive> DROP VIEW companyview;
OK
Time taken: 2.793 seconds
hive> CREATE VIEW companyview(id,name,email,houseno,street,city,state,pincode,country,passport,visa,mobile,phone)
AS SELECT
xpath(xmldata,'Company/Employee/Id/text()'),
xpath(xmldata,'Company/Employee/Name/text()'),
xpath(xmldata,'Company/Employee/Email/text()'),
xpath(xmldata,'Company/Employee/Address/HouseNo/text()'),xpath(xmldata,'Company/Employee/Address/Street/text()'),xpath(xmldata,'Company/Employee/Address/City/text()'),
xpath(xmldata,'Company/Employee/Address/State/text()'),
xpath(xmldata,'Company/Employee/Address/Pincode/text()'),xpath(xmldata,'Company/Employee/Address/Country/text()'),xpath(xmldata,'Company/Employee/Passport/text()'),
xpath(xmlUDF(xmldata),'Company/Employee/Visa/text()'),
xpath(xmldata,'Company/Employee/Contact/Mobile/text()'),
xpath(xmlUDF(xmldata),'Company/Employee/Contact/Phone/text()')
FROM companyxml;
OK
Time taken: 0.727 seconds

Here, I have used UDF for Visa and Phone as these two had an empty tag. Now check the stored data into view.

Step 4: Validate output

 
 
hive> select * from companyview;
OK
["458790","458791"]     ["Sameer","Gohar"]      ["sameer@email.com","Gohar@email.com"]  ["105","485"]   ["Grand Road","Camac Street Road"]        ["Bangalore","Mumbai"]  ["Karnataka","Maharastra"]      ["560068","400001"]     ["India","India"]["Available","Available"]        ["NULL","NULL"] ["9909999999","9908888888"]     ["8044552266","NULL"]
Time taken: 1.184 seconds, Fetched: 1 row(s)

Wrapping Up

Here, you have seen how to handle if XML data file contains one or many empty tags. This approach is suitable to handle these issues.

Keep Learning 🙂

45
0

Join in hive with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in pyspark with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in spark using scala with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Java UDF to convert String to date in PIG

About Code Many times it happens like you have received data from many systems and each system operates on a ...
Read More
/ java udf, Pig, pig, pig udf, string to date, udf

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.