Windowing Functions in Hive

Windowing Functions in Hive

Requirement

In this post, we are going to explore windowing functions in Hive. These are the windowing functions:

  • LEAD
  • LAD
  • FIRST_VALUE
  • LAST_VALUE
  • MIN/MAX/COUNT/AVG OVER Clause

Component Involved

  • Hive

Sample Data

ID

FIRST_NAME

LAST_NAME

DESIGNATION

DEPARTMENT

SALARY

1001

Jervis

Roll

Director of Sales

Sales

30000

1002

Gordon

Mattster

Marketing Manager

Sales

25000

1003

Gracie

Fronllen

Assistant Manager

Sales

25000

1004

Joelly

Wellback

Account Coordinator

Account

15000

1005

Bob

Havock

Accountant II

Account

20000

1006

Carmiae

Courage

Account Coordinator

Account

15000

1007

Cellie

Trevaskiss

Assistant Manager

Sales

25000

1008

Gally

Johnson

Manager

Account

28000

1009

Richard

Grill

Account Coordinator

Account

12000

1010

Sofia

Ketty

Sales Coordinator

Sales

20000

Solution

Let’s first discuss what is windowing?

Windowing allows features to create a window on the set of data in order to operate aggregation like COUNT, AVG, MIN, MAX and other analytical functions such as LEAD, LAG, FIRST_VALUE, and LAST_VALUE.

The syntax of the query with windows:

SELECT <columns_name>, <aggregate>(column_name) OVER (<windowing specification>) FROM <table_name>;

where,

column_name – column name of the table

Aggregate – Any aggregate function(s) like COUNT, AVG, MIN, MAX

Windowing specification – It includes following:

  • PARTITION BY – Takes a column(s) of the table as a reference.
  • ORDER BY – Specified the Order of column(s) either Ascending or Descending.
  • Frame – Specified the boundary of the frame by stat and end value. The boundary either be a type of RANGE or ROW followed by PRECEDING, FOLLOWING and any value.

These three (PARTITION, ORDER BY, and Window frame) are either be alone or together.

PARTITION BY

Count Employees in each department

 
 
SELECT department, COUNT(id) OVER (PARTITION BY department) FROM emp_dept_tbl;

This above output has employees count for each department. Here, getting duplicate records. In order to get the only distinct record, we can use the query like below:

 
 
SELECT DISTINCT * FROM (SELECT department, COUNT(id) OVER (PARTITION BY department) FROM emp_dept_tbl) A;

ORDER BY

Case I: Without PARTITION

Count Employee with salary descending order

 
 
SELECT id, department, salary, COUNT(id) OVER (ORDER BY salary DESC) FROM emp_dept_tbl;

Case II: With PARTITION

Count Employees of each department order by salary

 
 
SELECT id, department, salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC) FROM emp_dept_tbl;

WINDOWING Specification

In the windowing frame, you can define the subset of rows in which the windowing function will work. You can specify this subset using upper and lower boundary value using windowing specification.

The syntax to defined windowing specification with ROW/RANGE looks like:

ROW|RANGE BETWEEN <upper expression> AND <lower expression>

Here,

UPPER EXPRESSION can have these 3 value:

  • UNBOUNDED PRECEDING – It denotes window will start from the first row of the group/partition.
  • CURRENT ROW – Window will start from the current row.
  • <INTEGER VALUE> PRECEDING – Provide any specific row to start window

LOWER EXPRESSION

  • UNBOUNDED FOLLOWING – It means the window will end at the last row of the group/partition.
  • CURRENT ROW – Window will end at the current row
  • <INTEGER VALUE> FOLLOWING – Window will end at specific row

Now, let’s use these different upper and lower expression as a combination on different cases and check how it is working.

UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING

Case I: Count Employees with windowing specification

 
 
SELECT id, first_name, designation, department, COUNT(id) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM emp_dept_tbl;

It assigned final count of record to each record i.e. 10.

Case II: Count Employees with PARTITION BY department and windowing specification

 
 
SELECT id, first_name, designation, department, COUNT(id) OVER (PARTITION BY department ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM emp_dept_tbl;

Here, it has counted the total no. of rows based on the partitioned by the department and assigned the value to each row.

Case III: Count Employees with PARTITION BY department and ORDER BY salary DESC and windowing specification

 
 
SELECT id, first_name, designation, department, salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM emp_dept_tbl;

The output seems similar to the previous one except for the order of row based on the salary descending order.

UNBOUNDED PRECEDING and CURRENT ROW

Case I: Count Employees with windowing specification only

 
 
SELECT id, first_name, designation, department, salary, COUNT(id) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) FROM emp_dept_tbl;

Here, the row has been assigned its current position to the count value without any specific order.

Case II: Count Employees with PARTITION BY department and windowing specification

 
 
SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) FROM emp_dept_tbl;

Here, the row has started and gone till the end of each department that’s why Account and Sales department’s rows having count value 1 to 5 respectively.

Case III: Count Employees with PARTITION BY department and ORDER BY salary DESC and windowing specification

 
 
SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) FROM emp_dept_tbl;

Here, the output seems similar to the previous one except for the order of each row.

CURRENT ROW AND UNBOUNDED FOLLOWING

Case I: Count Employees with windowing specification only

 
 
SELECT id, first_name, designation, department, salary, COUNT(id) OVER (ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ) FROM emp_dept_tbl;

Case II: Count Employees with PARTITION BY department and windowing specification

 
 
SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) FROM emp_dept_tbl;

Case III: Count Employees with PARTITION BY department and ORDER BY salary DESC and windowing specification

 
 
SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) FROM emp_dept_tbl;

CURRENT ROW AND 3 FOLLOWING

 
 
SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING) FROM emp_dept_tbl;

3 PRECEDING AND 3 FOLLOWING

 
 
SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING) FROM emp_dept_tbl;

3 PRECEDING AND CURRENT ROW

 
 
SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) FROM emp_dept_tbl;

3 PRECEDING AND UNBOUNDED FOLLOWING

 
 
SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC ROWS BETWEEN 3 PRECEDING AND UNBOUNDED FOLLOWING) FROM emp_dept_tbl;

UNBOUNDED PRECEDING AND 3 FOLLOWING

 
 
SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 3 FOLLOWING) FROM emp_dept_tbl;

LEAD

It is an analytics function used to return the data from the next set of rows. By default, the lead is of 1 row and it will return NULL in case it exceeds the current window.

 
 
SELECT id, first_name, designation, department,salary, LEAD(id) OVER (PARTITION BY department ORDER BY salary) FROM emp_dept_tbl;
ID FIRST_NAME DESIGNATION DEPARTMENT SALARY LEAD
1009 Richard Account Coordinator Account 12000 1004
1004 Joelly Account Coordinator Account 15000 1006
1006 Carmiae Account Coordinator Account 15000 1005
1005 Bob Accountant II Account 20000 1008
1008 Gally Manager Account 28000 NULL
1010 Sofia Sales Coordnitor Sales 20000 1002
1002 Gordon Marketing Manager Sales 25000 1003
1003 Gracie Assistant Manager Sales 25000 1007
1007 Cellie Assistant Manager Sales 25000 1001
1001 Jervis Director of Sales Sales 30000 NULL

LAG

It is the opposite of LEAD function, it returns the data from the previous set of data. By default lag is of 1 row and return NULL in case the lag for the current row is exceeded before the beginning of the window:

 
 
SELECT id, first_name, designation, department,salary, LAG(id) OVER (PARTITION BY department ORDER BY salary) FROM emp_dept_tbl;
ID FIRST_NAME DESIGNATION DEPARTMENT SALARY LAG
1009 Richard Account Coordinator Account 12000 NULL
1004 Joelly Account Coordinator Account 15000 1009
1006 Carmiae Account Coordinator Account 15000 1004
1005 Bob Accountant II Account 20000 1006
1008 Gally Manager Account 28000 1005
1010 Sofia Sales Coordnitor Sales 20000 NULL
1002 Gordon Marketing Manager Sales 25000 1010
1003 Gracie Assistant Manager Sales 25000 1002
1007 Cellie Assistant Manager Sales 25000 1003
1001 Jervis Director of Sales Sales 30000 1007

FIRST_VALUE

This function returns the value from the first row in the window based on the clause and assigned to all the rows of the same group:

 
 
SELECT id, first_name, designation, department,salary, FIRST_VALUE(id) OVER (PARTITION BY department ORDER BY salary) FROM emp_dept_tbl;
ID FIRST_NAME DESIGNATION DEPARTMENT SALARY FIRST_VALUE
1009 Richard Account Coordinator Account 12000 1009
1004 Joelly Account Coordinator Account 15000 1009
1006 Carmiae Account Coordinator Account 15000 1009
1005 Bob Accountant II Account 20000 1009
1008 Gally Manager Account 28000 1009
1010 Sofia Sales Coordnitor Sales 20000 1010
1002 Gordon Marketing Manager Sales 25000 1010
1003 Gracie Assistant Manager Sales 25000 1010
1007 Cellie Assistant Manager Sales 25000 1010
1001 Jervis Director of Sales Sales 30000 1010

LAST_VALUE

In reverse of FIRST_VALUE, it return the value from the last row in a window based on the clause and assigned to all the rows of the same group:

ID FIRST_NAME DESIGNATION DEPARTMENT SALARY LAST_VALUE
1009 Richard Account Coordinator Account 12000 1009
1004 Joelly Account Coordinator Account 15000 1006
1006 Carmiae Account Coordinator Account 15000 1006
1005 Bob Accountant II Account 20000 1005
1008 Gally Manager Account 28000 1008
1010 Sofia Sales Coordnitor Sales 20000 1010
1002 Gordon Marketing Manager Sales 25000 1007
1003 Gracie Assistant Manager Sales 25000 1007
1007 Cellie Assistant Manager Sales 25000 1007
1001 Jervis Director of Sales Sales 30000 1001

Here, the first row has been assigned value 1009 but for the next two rows have assigned 1006 as both ids 1004 and 1006 have same salary 15000.

Wrapping Up

In this post, we have gone through the windowing function in Hive. We have used all the windowing function including LEAD, LAG, FIRST_VALUE, and LAST_VALUE and have seen how we can apply these function in the query. In addition to this, we checked the windowing specification with different lower and upper boundary expression.

0
0

Join in hive with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in pyspark with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in spark using scala with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Java UDF to convert String to date in PIG

About Code Many times it happens like you have received data from many systems and each system operates on a ...
Read More
/ java udf, Pig, pig, pig udf, string to date, udf

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.