Skip to content
Unable to parse XML…
 
Notifications
Clear all

Unable to parse XML data in Hive

1 Posts
2 Users
1 Reactions
4,711 Views
0
Topic starter

Hi All,

I am trying to parse some XML data with the xpath functions in Hive and receiving below error.

What i have done so far:

Sample Data:

<Store>
    <Version>1.1</Version>
    <StoreId>16695</StoreId>    
    <Bskt>
      <TillNo>4</TillNo>
      <BsktNo>1753</BsktNo>
      <DateTime>2017-10-31T11:19:34.000+11:00</DateTime>
      <OpID>50056</OpID>
      <Itm>
        <ItmSeq>1</ItmSeq>
        <GTIN>29559</GTIN>
        <ItmDsc>CHOCALATE</ItmDsc>
      <ItmProm>
          <PromCD>CM</PromCD>
        </ItmProm>
      </Itm>
      <Itm>
        <ItmSeq>2</ItmSeq>
        <GTIN>59653</GTIN>
        <ItmDsc>CORN FLAKES</ItmDsc>
      </Itm>
        <Itm>
        <ItmSeq>3</ItmSeq>
        <GTIN>42260</GTIN>
        <ItmDsc> MILK CHOCOLATE 162GM</ItmDsc>
        <ItmProm>
          <PromCD>MTSRO</PromCD>
          <OfferID>11766</OfferID>
        </ItmProm>
      </Itm>
    </Bskt>
    <Bskt>
      <TillNo>5</TillNo>
      <BsktNo>1947</BsktNo>
      <DateTime>2017-10-31T16:24:59.000+11:00</DateTime>
      <OpID>50063</OpID>
      <Itm>
        <ItmSeq>1</ItmSeq>
        <GTIN>24064</GTIN>
        <ItmDsc>TOMATOES 2KG</ItmDsc>
        <ItmProm>
          <PromCD>INSTORE</PromCD>
        </ItmProm>
      </Itm>
      <Itm>
        <ItmSeq>2</ItmSeq>
        <GTIN>81287</GTIN>
        <ItmDsc>ROTHMANS BLUE</ItmDsc>
        <ItmProm>
          <PromCD>TF</PromCD>
        </ItmProm>
      </Itm>
    </Bskt>
  </Store>  

 

1) Create an external table to read the entire XML data in a single column.

CREATE EXTERNAL TABLE poc_scanp_xml_single(xmldata STRING) LOCATION '/DEV/TEST/nanda_test';

2) Create a view on top 

CREATE VIEW xmldataview (version,storeid,basket_dtm,basket_num,till_number,item_sequence,gtin_number,item_desc,promo_code,offer_id)
AS SELECT
xpath_string(xmldata,'Store/Version/text()'),
xpath_string(xmldata,'Store/StoreId/text()'),
xpath_string(xmldata,'Store/Bskt/DateTime/text()'),
xpath_string(xmldata,'Store/Bskt/BsktNo/text()'),
xpath_string(xmldata,'Store/Bskt/TillNo/text()'),
xpath_string(xmldata,'Store/Bskt/Itm/ItmSeq/text()'),
xpath_string(xmldata,'Store/Bskt/Itm/GTIN/text()'),
xpath_string(xmldata,'Store/Bskt/Itm/ItmDsc/text()'),
xpath_string(xmldata,'Store/Bskt/Itm/ItmProm/PromCD/text()'),
xpath_string(xmldata,'Store/Bskt/Itm/ItmProm/OfferID/text()')
FROM POC_SRC_IGA_SCAN_BASKET_ITEM_PROMO_SINGLE;

Error while reading the data from view:

Bad status for request TFetchResultsReq(fetchType=0, operationHandle=TOperationHandle(hasResultSet=True, modifiedRowCount=None, operationType=0, operationId=THandleIdentifier(secret=’\x8c\x8c\xd4\xa6\xa8\x84F\x14\x80\xbb\x01\xc8$\x85\xaa\x9b’, guid=’~}\\\xef%\x0eI\x99\xbc\xff\x85p /\xbe5′)), orientation=4, maxRows=100): TFetchResultsResp(status=TStatus(errorCode=0, errorMessage=’java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public org.apache.hadoop.io.Text org.apache.hadoop.hive.ql.udf.xml.UDFXPathString.evaluate(java.lang.String,java.lang.String) on object org.apache.hadoop.hive.ql.udf.xml.UDFXPathString@6dfa6b18 of class org.apache.hadoop.hive.ql.udf.xml.UDFXPathString with arguments { <Store>:java.lang.String, Store/Version/text():java.lang.String} of size 2′, sqlState=None, infoMessages=[‘*org.apache.hive.service.cli.HiveSQLException:java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public org.apache.hadoop.io.Text org.apache.hadoop.hive.ql.udf.xml.UDFXPathString.evaluate(java.lang.String,java.lang.String) on object org.apache.hadoop.hive.ql.udf.xml.UDFXPathString@6dfa6b18 of class org.apache.hadoop.hive.ql.udf.xml.UDFXPathString with arguments { <Store>:java.lang.String, Store/Version/text():java.lang.String} of size 2:25:24’, ‘org.apache.hive.service.cli.operation.SQLOperation:getNextRowSet:SQLOperation.java:329’, ‘org.apache.hive.service.cli.operation.OperationManager:getOperationNextRowSet:OperationManager.java:262’, ‘org.apache.hive.service.cli.session.HiveSessionImpl:fetchResults:HiveSessionImpl.java:732’, ‘sun.reflect.GeneratedMethodAccessor12:invoke::-1’, ‘sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43’, ‘java.lang.reflect.Method:invoke:Method.java:606’, ‘org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:78’, ‘org.apache.hive.service.cli.session.HiveSessionProxy:access$000:HiveSessionProxy.java:36’, ‘org.apache.hive.service.cli.session.HiveSessionProxy$1:run:HiveSessionProxy.java:63’, ‘java.security.AccessController:doPrivileged:AccessController.java:-2’, ‘javax.security.auth.Subject:doAs:Subject.java:415’, ‘org.apache.hadoop.security.UserGroupInformation:doAs:UserGroupInformation.java:1671’, ‘org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:59’, ‘com.sun.proxy.$Proxy25:fetchResults::-1’, ‘org.apache.hive.service.cli.CLIService:fetchResults:CLIService.java:438’, ‘org.apache.hive.service.cli.thrift.ThriftCLIService:FetchResults:ThriftCLIService.java:695’, ‘org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults:getResult:TCLIService.java:1553’, ‘org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults:getResult:TCLIService.java:1538’, ‘org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39’, ‘org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39’, ‘org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56’, ‘org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:285’, ‘java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1145’, ‘java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:615’, ‘java.lang.Thread:run:Thread.java:745’, ‘*java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public org.apache.hadoop.io.Text org.apache.hadoop.hive.ql.udf.xml.UDFXPathString.evaluate(java.lang.String,java.lang.String) on object org.apache.hadoop.hive.ql.udf.xml.UDFXPathString@6dfa6b18 of class org.apache.hadoop.hive.ql.udf.xml.UDFXPathString with arguments { <Store>:java.lang.String, Store/Version/text():java.lang.String} of size 2:27:2’, ‘org.apache.hadoop.hive.ql.exec.FetchTask:fetch:FetchTask.java:152’, ‘org.apache.hadoop.hive.ql.Driver:getResults:Driver.java:1653’, ‘org.apache.hive.service.cli.operation.SQLOperation:getNextRowSet:SQLOperation.java:324’, ‘*org.apache.hadoop.hive.ql.metadata.HiveException:Unable to execute method public org.apache.hadoop.io.Text org.apache.hadoop.hive.ql.udf.xml.UDFXPathString.evaluate(java.lang.String,java.lang.String) on object org.apache.hadoop.hive.ql.udf.xml.UDFXPathString@6dfa6b18 of class org.apache.hadoop.hive.ql.udf.xml.UDFXPathString with arguments { <Store>:java.lang.String, Store/Version/text():java.lang.String} of size 2:37:10’, ‘org.apache.hadoop.hive.ql.exec.FunctionRegistry:invoke:FunctionRegistry.java:967’, ‘org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge:evaluate:GenericUDFBridge.java:182’, ‘org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator:_evaluate:ExprNodeGenericFuncEvaluator.java:185’, ‘org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator:evaluate:ExprNodeEvaluator.java:77’, ‘org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator:evaluate:ExprNodeEvaluator.java:65’, ‘org.apache.hadoop.hive.ql.exec.SelectOperator:processOp:SelectOperator.java:77’, ‘org.apache.hadoop.hive.ql.exec.Operator:forward:Operator.java:815’, ‘org.apache.hadoop.hive.ql.exec.TableScanOperator:processOp:TableScanOperator.java:95’, ‘org.apache.hadoop.hive.ql.exec.FetchOperator:pushRow:FetchOperator.java:424’, ‘org.apache.hadoop.hive.ql.exec.FetchOperator:pushRow:FetchOperator.java:416’, ‘org.apache.hadoop.hive.ql.exec.FetchTask:fetch:FetchTask.java:138’, ‘*java.lang.reflect.InvocationTargetException:null:40:3’, ‘sun.reflect.GeneratedMethodAccessor311:invoke::-1’, ‘sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43’, ‘java.lang.reflect.Method:invoke:Method.java:606’, ‘org.apache.hadoop.hive.ql.exec.FunctionRegistry:invoke:FunctionRegistry.java:943’, “*java.lang.RuntimeException:Invalid expression ‘Store/Version/text()’:43:2”, ‘org.apache.hadoop.hive.ql.udf.xml.UDFXPathUtil:eval:UDFXPathUtil.java:74’, ‘org.apache.hadoop.hive.ql.udf.xml.UDFXPathUtil:evalString:UDFXPathUtil.java:83’, ‘org.apache.hadoop.hive.ql.udf.xml.UDFXPathString:evaluate:UDFXPathString.java:43’, ‘*javax.xml.xpath.XPathExpressionException:org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 10; XML document structures must start and end within the same entity.:44:1’, ‘com.sun.org.apache.xpath.internal.jaxp.XPathExpressionImpl:evaluate:XPathExpressionImpl.java:305’, ‘org.apache.hadoop.hive.ql.udf.xml.UDFXPathUtil:eval:UDFXPathUtil.java:72’, ‘*org.xml.sax.SAXParseException:XML document structures must start and end within the same entity.:46:2’, ‘org.apache.xerces.parsers.DOMParser:parse::-1’, ‘org.apache.xerces.jaxp.DocumentBuilderImpl:parse::-1’, ‘com.sun.org.apache.xpath.internal.jaxp.XPathExpressionImpl:evaluate:XPathExpressionImpl.java:302’], statusCode=3), results=None, hasMoreRows=None)

 

 




1 Answer
1

CREATE EXTERNAL TABLE poc_scanp_xml_single(xmldata STRING) LOCATION ‘/DEV/TEST/nanda_test’;

like above ,if you load a xml file without giving any properties than you would have one line of xml into xmldata string,but actually you should have everything which comes under “Store” tag ,into one line .i.e in xmldata. 

check attached image and change the values accordingly . You may need to add a jar which will have properties to deal with XML. 

 
Share: