SUM in pig

Problem 1

Write a pig script to calculate the sum of profits earned by selling a particular product.
Below is the code to create data into hive table.

 
 
  1. CREATE SCHEMA IF NOT EXISTS bdp;
  2. CREATE TABLE bdp.profits (product_id INT,profit BIGINT);
  3. INSERT INTO TABLE bdp.profits VALUES
  4. ('123','1365'),('124','3253'),('125','91522'),
  5. ('123','51842'),('127','19616'),('128','2433'),
  6. ('127','182652'),('130','21632'),('122','21632'),
  7. ('127','21632'),('135','21632'),('123','21632'),('135','3282');

Solution
Step 1: Load Data

Let’s first login to hive and load data into table “profits” which is under bdp schema.
After executing query, verify that data is loaded successfully.

use below command

 
 
  1. SELECT * FROM bdp.profits;

 


After analyzing data we can say that each product has associated profits with it, and also some products have multiple profits. Our Goal is to calculate total profit on particular product id.

Step 2: Import table into pig

As we need to process this dataset using Pig so let’s go to grunt shell, use below command to enter into grunt shell, remember –useHCatalog is must as we need jars which are required to fetch data from a hive.

 
 
  1. pig -useHCatalog;

Let’s have one relation PROFITS in which we can load data from hive table.

 
 
  1. PROFITS = LOAD 'bdp.profits' USING org.apache.hive.hcatalog.pig.HCatLoader();

dump PROFITS will give below result.

Step 3: Grouping based on product_id

As we have column product_id, so grouping on a product will give us all tuples which have same product id into one bag.
Execute below command to get grouped records.

 
 
  1. GRPD_PROFITS = GROUP PROFITS BY product_id;

dump GRPD_PROFITS will give below result.

You can observe here that product id 123 have 3 tuples.
You can use describe to see the schema of relation GRPD_PROFITS
Below is the result.

Step 4: SUM operator

Now it’s time to apply Sum operator on every tuple of each product id. All profits will be added .use below command

 
 
  1. SUM_PROFIT = FOREACH GRPD_PROFITS GENERATE group,SUM(PROFITS.profit) as total_profit;

In the output, we want only group i.e product_id and sum of profits i.e total_profit .so we are generating only group key and total profit.
Below is the results.

Observe that total selling profit of product which has id 123 is 74839.


Keep solving, keep learning.Subscribe us.

Load CSV file into hive AVRO table

Requirement You have comma separated(CSV) file and you want to create Avro table in hive on top of it, then ...
Read More

Load CSV file into hive PARQUET table

Requirement You have comma separated(CSV) file and you want to create Parquet table in hive on top of it, then ...
Read More

Hive Most Asked Interview Questions With Answers – Part II

What is bucketing and what is the use of it? Answer: Bucket is an optimisation technique which is used to ...
Read More
/ hive, hive interview, interview-qa

Spark Interview Questions Part-1

Suppose you have a spark dataframe which contains millions of records. You need to perform multiple actions on it. How ...
Read More

Leave a Reply