Load xml file in pig


Assume you have the XML file which is transferred to your local system by some other application. The file has customer’s data and it is needed to process this data using pig. But the challenge here is that file is not simple text or CSV file, it is the XML file. Let’s try this.


Please follow the below steps:-

Step 1: Sample file

Don’t you have customer’s data? No worry download the file from here customers_data or Create a sample file named as customers_data.xml.If you have a file in windows then transfer it to your Linux machine via WinSCP.I have a local directory as /root/bigdataprogrammers/input_files, so I have kept a customers_data.xml file in that directory.

Please refer below screenshot.

You can see content of that file using below command in shell.

  1. cat /root/bigdataprogrammers/input_files/customers_data.xml

Please observe content carefully, because in this whole article understanding the content of the file is most important.
You can see that we have “customer” tag which is repeated and every time it has information of different customer .which further have three subtags and that is “name”,”id” and “city”.

Now our main aim is to create a single record for each customer and this one record will have three columns as name, id, and city.once we achieve this into one relation of pig, we are done.we can further process this relation as per our needs.

Step 2: Move to Hdfs

As this XML file is present in a local system, so let’s keep it into hdfs location. I already have location i.e. ‘hdfs://sandbox.hortonworks.com:8020/user/root/bigdataprogrammers/ip/’
Please use below command to put this XML file to hdfs location.

  1. hadoop fs -put /root/bigdataprogrammers/input_files/customers_data.xml hdfs://sandbox.hortonworks.com:8020/user/root/bigdataprogrammers/ip/

Enter a below-mentioned command in putty

  1. pig

It will take you to the grunt shell.

Step 2: Macro definition

There must be something which should read values which are present inside the tags right?Like when I say <id>34</id> it should give me 34. So for that, we need to create a macro as below.

  1. DEFINE readVal org.apache.pig.piggybank.evaluation.xml.XPath();

So now we have readVal, as name implies we can use this to read values.

Step 3: Loading file

Now load the XML file in relation CUSTOMERS_DATA using XMLLoader.please use below command and mention the hdfs input location of the XML file.

  1. CUSTOMERS_DATA = load 'hdfs://sandbox.hortonworks.com:8020/user/root/bigdataprogrammers/ip/customers_data.xml' using org.apache.pig.piggybank.storage.XMLLoader('customer') as (customer);

Here we have sent customer as an argument because we want to treat this as our single record.as this is iterating over n times, we will be having n records in the output.

Step 4: Extracting Columns

Now we have each customer’s data into one record, it’s time to extract respective values using readVal.
Please use below command to get the values

  1. CNV_CUSTOMER_DATA = foreach CUSTOMERS_DATA generate readVal (customer,'name') AS name,readVal (customer,'id') AS id,readVal (customer,'city') AS city;

As you can see we want the name of a customer, so XPath would be first read customer and then name, hence we have written customer and name in arguments.

Similarly, we can pass id and city.
Well, we are done, now our relation CNV_CUSTOMER_DATA have all the values .now we can use this relation to process further as per requirements.

Step 5: Output

Use below command to see the dataset CNV_CUSTOMER_DATA.


Please refer below screen shot.

As you can see each customer’s data have one single record with three columns.

Wrapping Up

You must have heard that pig can be used to process semi-structured data, right? So just now we have processed semi-structured data. Because XML is considered as semi-structured data.

Keep learning.

Load CSV file into hive AVRO table

Requirement You have comma separated(CSV) file and you want to create Avro table in hive on top of it, then ...
Read More

Load CSV file into hive PARQUET table

Requirement You have comma separated(CSV) file and you want to create Parquet table in hive on top of it, then ...
Read More

Hive Most Asked Interview Questions With Answers – Part II

What is bucketing and what is the use of it? Answer: Bucket is an optimisation technique which is used to ...
Read More
/ hive, hive interview, interview-qa

Spark Interview Questions Part-1

Suppose you have a spark dataframe which contains millions of records. You need to perform multiple actions on it. How ...
Read More

Leave a Reply