Load xml file in pig

Requirement

Assume you have the XML file which is transferred to your local system by some other application. The file has customer’s data and it is needed to process this data using Pig. But the challenge here is that file is not simple text or CSV file, it is the XML file. 

Solution

Follow the below steps:

Step 1: Sample file

Don’t you have customer’s data? No worry download the file from here customers_data or Create a sample file named as customers_data.xml. If you have a file in windows, then transfer it to your Linux machine via WinSCP. I have a local directory as /root/bigdataprogrammers/input_files, so I have kept a customers_data.xml file in that directory.

Please refer below screenshot.

You can see the content of that file using below command in the shell:

 cat /root/bigdataprogrammers/input_files/customers_data.xml

Please observe the content carefully, because in this whole article understanding the content of the file is most important.
You can see that we have “customer” tag which is repeated and every time it has information of the different customer, which further have three subtags – “name”,”id” and “city”.

Now our main aim is to create a single record for each customer and each record will have three columns as name, id, and city. Once we load this into one relation of Pig, then we can process this relation as per our needs.

Step 2: Move to Hdfs

As this XML file is present in a local system, so let’s keep it into HDFS location. I already have location i.e. ‘hdfs://sandbox.hortonworks.com:8020/user/root/bigdataprogrammers/ip/’
Use below command to put this XML file to HDFS location:

 hadoop fs -put /root/bigdataprogrammers/input_files/customers_data.xml hdfs://sandbox.hortonworks.com:8020/user/root/bigdataprogrammers/ip/

Enter a below-mentioned command in terminal:

 pig

It will take you to the grunt shell.

Step 2: Macro definition

There must be something which should read values which are present inside the tags right? Like when I say <id>34</id> it should give me 34. So for that, we need to create a macro as below.

 DEFINE readVal org.apache.pig.piggybank.evaluation.xml.XPath();

So, now we have readVal, as the name implies we can use this to read values.

Step 3: Loading file

Now, load the XML file in relation CUSTOMERS_DATA using XMLLoader. Use below command and mention the HDFS input location of the XML file.

CUSTOMERS_DATA = load 'hdfs://sandbox.hortonworks.com:8020/user/root/bigdataprogrammers/ip/customers_data.xml' using org.apache.pig.piggybank.storage.XMLLoader('customer') as (customer);

Here, we have sent customer as an argument because we want to treat this as our single record. As this is iterating over n times, we will be having n records in the output.

Step 4: Extracting Columns

Now, we have each customer’s data into one record, it’s time to extract respective values using readVal.
Use below command to get the values:

 CNV_CUSTOMER_DATA = foreach CUSTOMERS_DATA generate readVal (customer,'name') AS name,readVal (customer,'id') AS id,readVal (customer,'city') AS city;

As you can see we want the name of a customer, so XPath would be first read customer and then name, hence we have written customer and name in arguments.

Similarly, we can pass id and city.
The relation CNV_CUSTOMER_DATA have all the values. Now, we can use this relation to process further as per the requirements.

Step 5: Output

Use below command to see the dataset of CNV_CUSTOMER_DATA:

 dump CNV_CUSTOMER_DATA

Please refer below screenshot.

As you can see each customer’s data have one single record with three columns.

Wrapping Up

You must have heard that pig can be used to process semi-structured data, right? So just now we have processed semi-structured data because XML is considered as semi-structured data.

Keep learning.

Don’t miss the tutorial on Top Big data courses on Udemy you should Buy

Sharing is caring!

Subscribe to our newsletter
Loading

Leave a Reply