Assume you have the XML file which is transferred to your local system by some other application. The file has customer’s data and it is needed to process this data using pig. But the challenge here is that file is not simple text or CSV file, it is the XML file. Let’s try this.
Please follow the below steps:-
Step 1: Sample file
Don’t you have customer’s data? No worry download the file from here customers_data or Create a sample file named as customers_data.xml.If you have a file in windows then transfer it to your Linux machine via WinSCP.I have a local directory as /root/bigdataprogrammers/input_files, so I have kept a customers_data.xml file in that directory.
Please refer below screenshot.
You can see content of that file using below command in shell.
Please observe content carefully, because in this whole article understanding the content of the file is most important.
You can see that we have “customer” tag which is repeated and every time it has information of different customer .which further have three subtags and that is “name”,”id” and “city”.
Now our main aim is to create a single record for each customer and this one record will have three columns as name, id, and city.once we achieve this into one relation of pig, we are done.we can further process this relation as per our needs.
Step 2: Move to Hdfs
As this XML file is present in a local system, so let’s keep it into hdfs location. I already have location i.e. ‘hdfs://sandbox.hortonworks.com:8020/user/root/bigdataprogrammers/ip/’
Please use below command to put this XML file to hdfs location.
hadoop fs -put /root/bigdataprogrammers/input_files/customers_data.xml hdfs://sandbox.hortonworks.com:8020/user/root/bigdataprogrammers/ip/
Enter a below-mentioned command in putty
It will take you to the grunt shell.
Step 2: Macro definition
There must be something which should read values which are present inside the tags right?Like when I say <id>34</id> it should give me 34. So for that, we need to create a macro as below.
- DEFINE readVal org.apache.pig.piggybank.evaluation.xml.XPath();
So now we have readVal, as name implies we can use this to read values.
Step 3: Loading file
Now load the XML file in relation CUSTOMERS_DATA using XMLLoader.please use below command and mention the hdfs input location of the XML file.
- CUSTOMERS_DATA = load 'hdfs://sandbox.hortonworks.com:8020/user/root/bigdataprogrammers/ip/customers_data.xml' using org.apache.pig.piggybank.storage.XMLLoader('customer') as (customer);
Here we have sent customer as an argument because we want to treat this as our single record.as this is iterating over n times, we will be having n records in the output.
Step 4: Extracting Columns
Now we have each customer’s data into one record, it’s time to extract respective values using readVal.
Please use below command to get the values
- CNV_CUSTOMER_DATA = foreach CUSTOMERS_DATA generate readVal (customer,'name') AS name,readVal (customer,'id') AS id,readVal (customer,'city') AS city;
As you can see we want the name of a customer, so XPath would be first read customer and then name, hence we have written customer and name in arguments.
Similarly, we can pass id and city.
Well, we are done, now our relation CNV_CUSTOMER_DATA have all the values .now we can use this relation to process further as per requirements.
Step 5: Output
Use below command to see the dataset CNV_CUSTOMER_DATA.
- dump CNV_CUSTOMER_DATA
Please refer below screen shot.
As you can see each customer’s data have one single record with three columns.
You must have heard that pig can be used to process semi-structured data, right? So just now we have processed semi-structured data. Because XML is considered as semi-structured data.