Load xml file in pig

Load xml file in pig

Requirement

Assume you have the XML file which is transferred to your local system by some other application. The file has customer’s data and it is needed to process this data using pig. But the challenge here is that file is not simple text or CSV file, it is the XML file. Let’s try this.


Solution

Please follow the below steps:-

Step 1: Sample file

Don’t you have customer’s data? No worry download the file from here customers_data or Create a sample file named as customers_data.xml.If you have a file in windows then transfer it to your Linux machine via WinSCP.I have a local directory as /root/bigdataprogrammers/input_files, so I have kept a customers_data.xml file in that directory.

Please refer below screenshot.

You can see content of that file using below command in shell.

 
 
cat /root/bigdataprogrammers/input_files/customers_data.xml

Please observe content carefully, because in this whole article understanding the content of the file is most important.
You can see that we have “customer” tag which is repeated and every time it has information of different customer .which further have three subtags and that is “name”,”id” and “city”.

Now our main aim is to create a single record for each customer and this one record will have three columns as name, id, and city.once we achieve this into one relation of pig, we are done.we can further process this relation as per our needs.

Step 2: Move to Hdfs

As this XML file is present in a local system, so let’s keep it into hdfs location. I already have location i.e. ‘hdfs://sandbox.hortonworks.com:8020/user/root/bigdataprogrammers/ip/’
Please use below command to put this XML file to hdfs location.

 
 
hadoop fs -put /root/bigdataprogrammers/input_files/customers_data.xml hdfs://sandbox.hortonworks.com:8020/user/root/bigdataprogrammers/ip/

Enter a below-mentioned command in putty

 
 
pig

It will take you to the grunt shell.

Step 2: Macro definition

There must be something which should read values which are present inside the tags right?Like when I say <id>34</id> it should give me 34. So for that, we need to create a macro as below.

 
 
  1. DEFINE readVal org.apache.pig.piggybank.evaluation.xml.XPath();

So now we have readVal, as name implies we can use this to read values.

Step 3: Loading file

Now load the XML file in relation CUSTOMERS_DATA using XMLLoader.please use below command and mention the hdfs input location of the XML file.

 
 
  1. CUSTOMERS_DATA = load 'hdfs://sandbox.hortonworks.com:8020/user/root/bigdataprogrammers/ip/customers_data.xml' using org.apache.pig.piggybank.storage.XMLLoader('customer') as (customer);

Here we have sent customer as an argument because we want to treat this as our single record.as this is iterating over n times, we will be having n records in the output.

Step 4: Extracting Columns

Now we have each customer’s data into one record, it’s time to extract respective values using readVal.
Please use below command to get the values

 
 
  1. CNV_CUSTOMER_DATA = foreach CUSTOMERS_DATA generate readVal (customer,'name') AS name,readVal (customer,'id') AS id,readVal (customer,'city') AS city;

As you can see we want the name of a customer, so XPath would be first read customer and then name, hence we have written customer and name in arguments.

Similarly, we can pass id and city.
Well, we are done, now our relation CNV_CUSTOMER_DATA have all the values .now we can use this relation to process further as per requirements.

Step 5: Output

Use below command to see the dataset CNV_CUSTOMER_DATA.

 
 
  1. dump CNV_CUSTOMER_DATA

Please refer below screen shot.

As you can see each customer’s data have one single record with three columns.

Wrapping Up

You must have heard that pig can be used to process semi-structured data, right? So just now we have processed semi-structured data. Because XML is considered as semi-structured data.

Keep learning.

27
0

Join in hive with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in pyspark with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in spark using scala with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Java UDF to convert String to date in PIG

About Code Many times it happens like you have received data from many systems and each system operates on a ...
Read More
/ java udf, Pig, pig, pig udf, string to date, udf

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.