Pass variables from shell script to pig script

Requirement

You have one Pig script which is expecting some variables which need to be passed from a shell script.Say name of pig scripts is daily_audit.pig .it is expecting three variables which are as follows

  • ip_loc
  • no_of_emp
  • op_loc

Solution

Step 1:

Let’s see content of daily_audit.pig

 
 
  1. Daily_Audit = Load '${ip_loc}' using PigStorage(',') As (Company:chararray,empl:int);
  2. Audit = Filter Daily_Audit by empl>${no_of_emp};
  3. store Audit INTO '${op_loc}' using PigStorage(',');

 

I can say that three variable are required to be declared in a shell script. I recommended you to focus only on variables instead of logic in this Pig script.

Step 2: Assignation of variables

Let’s declare these three variables in shell script

 
 
  1. Input_location=bigdataprogrammers/ip/
  2. employees=3000
  3. output_location=bigdataprogrammers/op/

You can change these variables when you need and that’s what the use of assigning variables in a shell script. In real time these values are assigned from the output of another process.

Step 3: Call Pig Script

Once assignation is complete now we can pass them while calling Pig Script.
So here is the command

 
 
  1. pig -f "hdfs://sandbox.hortonworks.com:8020/user/root/bigdataprogrammers/daily_audit.pig" -param ip_loc=$Input_location -param no_of_emp=$employees -param op_loc=$output_location

In above command you can see that using $ sign we are taking values of a particular variable and assigning it to the variable which is defined in Pig. For example
” ip_loc ” is present in Pig  script but ” Input_location ” is defined in a shell script. So you have to write like ip_loc=$Input_location where the first one is variable of Pig script and the second one is variable of shell script. You must use -param for each variable while calling pig script.

Another Way

Instead of passing variable side by side, we can use parameter file which has all the variables.
Let’s have one file parameters.txt

 
 
  1. ip_loc=bigdataprogrammers/ip/
  2. no_of_emp=3000
  3. op_loc=bigdataprogrammers/op_new/

Define all variables in it.
And while calling Pig script, simultaneously you should call this file.
Below is the command.

 
 
  1. pig -f "hdfs://sandbox.hortonworks.com:8020/user/root/bigdataprogrammers/daily_audit.pig" -param_file 'hdfs://sandbox.hortonworks.com:8020/user/root/bigdataprogrammers/parameters.txt'

Use – param_file to call parameter file.

Wrapping Up

In real time projects we use output and input location of data based on business date on which we are processing data. As we process data daily.so every day variables needs to be changed and we can’t hard-code in Pig script .in that case we can assign parameter(s) in a shell script.

Keep learning .

Load CSV file into hive AVRO table

Requirement You have comma separated(CSV) file and you want to create Avro table in hive on top of it, then ...
Read More

Load CSV file into hive PARQUET table

Requirement You have comma separated(CSV) file and you want to create Parquet table in hive on top of it, then ...
Read More

Hive Most Asked Interview Questions With Answers – Part II

What is bucketing and what is the use of it? Answer: Bucket is an optimisation technique which is used to ...
Read More
/ hive, hive interview, interview-qa

Spark Interview Questions Part-1

Suppose you have a spark dataframe which contains millions of records. You need to perform multiple actions on it. How ...
Read More

Leave a Reply