Windowing Functions in Hive

Requirement

In this post, we are going to explore windowing functions in Hive. These are the windowing functions:

  • LEAD
  • LAD
  • FIRST_VALUE
  • LAST_VALUE
  • MIN/MAX/COUNT/AVG OVER Clause

Component Involved

  • Hive

Sample Data

ID

FIRST_NAME

LAST_NAME

DESIGNATION

DEPARTMENT

SALARY

1001

Jervis

Roll

Director of Sales

Sales

30000

1002

Gordon

Mattster

Marketing Manager

Sales

25000

1003

Gracie

Fronllen

Assistant Manager

Sales

25000

1004

Joelly

Wellback

Account Coordinator

Account

15000

1005

Bob

Havock

Accountant II

Account

20000

1006

Carmiae

Courage

Account Coordinator

Account

15000

1007

Cellie

Trevaskiss

Assistant Manager

Sales

25000

1008

Gally

Johnson

Manager

Account

28000

1009

Richard

Grill

Account Coordinator

Account

12000

1010

Sofia

Ketty

Sales Coordinator

Sales

20000

Solution

Let’s first discuss what is windowing?

Windowing allows features to create a window on the set of data in order to operate aggregation like COUNT, AVG, MIN, MAX and other analytical functions such as LEAD, LAG, FIRST_VALUE, and LAST_VALUE.

The syntax of the query with windows:

SELECT <columns_name>, <aggregate>(column_name) OVER (<windowing specification>) FROM <table_name>;

where,

column_name – column name of the table

Aggregate – Any aggregate function(s) like COUNT, AVG, MIN, MAX

Windowing specification – It includes following:

  • PARTITION BY – Takes a column(s) of the table as a reference.
  • ORDER BY – Specified the Order of column(s) either Ascending or Descending.
  • Frame – Specified the boundary of the frame by stat and end value. The boundary either be a type of RANGE or ROW followed by PRECEDING, FOLLOWING and any value.

These three (PARTITION, ORDER BY, and Window frame) are either be alone or together.

PARTITION BY

Count Employees in each department

 SELECT department, COUNT(id) OVER (PARTITION BY department) FROM emp_dept_tbl;

This above output has employees count for each department. Here, getting duplicate records. In order to get the only distinct record, we can use the query like below:

 SELECT DISTINCT * FROM (SELECT department, COUNT(id) OVER (PARTITION BY department) FROM emp_dept_tbl) A;

ORDER BY

Case I: Without PARTITION

Count Employee with salary descending order

 SELECT id, department, salary, COUNT(id) OVER (ORDER BY salary DESC) FROM emp_dept_tbl;

Case II: With PARTITION

Count Employees of each department order by salary

 SELECT id, department, salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC) FROM emp_dept_tbl;

WINDOWING Specification

In the windowing frame, you can define the subset of rows in which the windowing function will work. You can specify this subset using upper and lower boundary value using windowing specification.

The syntax to defined windowing specification with ROW/RANGE looks like:

ROW|RANGE BETWEEN <upper expression> AND <lower expression>

Here,

UPPER EXPRESSION can have these 3 value:

  • UNBOUNDED PRECEDING – It denotes window will start from the first row of the group/partition.
  • CURRENT ROW – Window will start from the current row.
  • <INTEGER VALUE> PRECEDING – Provide any specific row to start window

LOWER EXPRESSION

  • UNBOUNDED FOLLOWING – It means the window will end at the last row of the group/partition.
  • CURRENT ROW – Window will end at the current row
  • <INTEGER VALUE> FOLLOWING – Window will end at specific row

Now, let’s use these different upper and lower expression as a combination on different cases and check how it is working.

UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING

Case I: Count Employees with windowing specification

 SELECT id, first_name, designation, department, COUNT(id) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM emp_dept_tbl;

It assigned final count of record to each record i.e. 10.

Case II: Count Employees with PARTITION BY department and windowing specification

 SELECT id, first_name, designation, department, COUNT(id) OVER (PARTITION BY department ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM emp_dept_tbl;

Here, it has counted the total no. of rows based on the partitioned by the department and assigned the value to each row.

Case III: Count Employees with PARTITION BY department and ORDER BY salary DESC and windowing specification

 SELECT id, first_name, designation, department, salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM emp_dept_tbl;

The output seems similar to the previous one except for the order of row based on the salary descending order.

UNBOUNDED PRECEDING and CURRENT ROW

Case I: Count Employees with windowing specification only

 SELECT id, first_name, designation, department, salary, COUNT(id) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) FROM emp_dept_tbl;

Here, the row has been assigned its current position to the count value without any specific order.

Case II: Count Employees with PARTITION BY department and windowing specification

 SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) FROM emp_dept_tbl;

Here, the row has started and gone till the end of each department that’s why Account and Sales department’s rows having count value 1 to 5 respectively.

Case III: Count Employees with PARTITION BY department and ORDER BY salary DESC and windowing specification

 SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) FROM emp_dept_tbl;

Here, the output seems similar to the previous one except for the order of each row.

CURRENT ROW AND UNBOUNDED FOLLOWING

Case I: Count Employees with windowing specification only

 SELECT id, first_name, designation, department, salary, COUNT(id) OVER (ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ) FROM emp_dept_tbl;

Case II: Count Employees with PARTITION BY department and windowing specification

 SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) FROM emp_dept_tbl;

Case III: Count Employees with PARTITION BY department and ORDER BY salary DESC and windowing specification

 SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) FROM emp_dept_tbl;

CURRENT ROW AND 3 FOLLOWING

 SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING) FROM emp_dept_tbl;

3 PRECEDING AND 3 FOLLOWING

 SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING) FROM emp_dept_tbl;

3 PRECEDING AND CURRENT ROW

 SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) FROM emp_dept_tbl;

3 PRECEDING AND UNBOUNDED FOLLOWING

 SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC ROWS BETWEEN 3 PRECEDING AND UNBOUNDED FOLLOWING) FROM emp_dept_tbl;

UNBOUNDED PRECEDING AND 3 FOLLOWING

 SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 3 FOLLOWING) FROM emp_dept_tbl;

LEAD

It is an analytics function used to return the data from the next set of rows. By default, the lead is of 1 row and it will return NULL in case it exceeds the current window.

 SELECT id, first_name, designation, department,salary, LEAD(id) OVER (PARTITION BY department ORDER BY salary) FROM emp_dept_tbl;
IDFIRST_NAMEDESIGNATIONDEPARTMENTSALARYLEAD
1009RichardAccount CoordinatorAccount120001004
1004JoellyAccount CoordinatorAccount150001006
1006CarmiaeAccount CoordinatorAccount150001005
1005BobAccountant IIAccount200001008
1008GallyManagerAccount28000NULL
1010SofiaSales CoordnitorSales200001002
1002GordonMarketing ManagerSales250001003
1003GracieAssistant ManagerSales250001007
1007CellieAssistant ManagerSales250001001
1001JervisDirector of SalesSales30000NULL

LAG

It is the opposite of LEAD function, it returns the data from the previous set of data. By default lag is of 1 row and return NULL in case the lag for the current row is exceeded before the beginning of the window:

 SELECT id, first_name, designation, department,salary, LAG(id) OVER (PARTITION BY department ORDER BY salary) FROM emp_dept_tbl;
IDFIRST_NAMEDESIGNATIONDEPARTMENTSALARYLAG
1009RichardAccount CoordinatorAccount12000NULL
1004JoellyAccount CoordinatorAccount150001009
1006CarmiaeAccount CoordinatorAccount150001004
1005BobAccountant IIAccount200001006
1008GallyManagerAccount280001005
1010SofiaSales CoordnitorSales20000NULL
1002GordonMarketing ManagerSales250001010
1003GracieAssistant ManagerSales250001002
1007CellieAssistant ManagerSales250001003
1001JervisDirector of SalesSales300001007

FIRST_VALUE

This function returns the value from the first row in the window based on the clause and assigned to all the rows of the same group:

 SELECT id, first_name, designation, department,salary, FIRST_VALUE(id) OVER (PARTITION BY department ORDER BY salary) FROM emp_dept_tbl;
IDFIRST_NAMEDESIGNATIONDEPARTMENTSALARYFIRST_VALUE
1009RichardAccount CoordinatorAccount120001009
1004JoellyAccount CoordinatorAccount150001009
1006CarmiaeAccount CoordinatorAccount150001009
1005BobAccountant IIAccount200001009
1008GallyManagerAccount280001009
1010SofiaSales CoordnitorSales200001010
1002GordonMarketing ManagerSales250001010
1003GracieAssistant ManagerSales250001010
1007CellieAssistant ManagerSales250001010
1001JervisDirector of SalesSales300001010

LAST_VALUE

In reverse of FIRST_VALUE, it return the value from the last row in a window based on the clause and assigned to all the rows of the same group:

IDFIRST_NAMEDESIGNATIONDEPARTMENTSALARYLAST_VALUE
1009RichardAccount CoordinatorAccount120001009
1004JoellyAccount CoordinatorAccount150001006
1006CarmiaeAccount CoordinatorAccount150001006
1005BobAccountant IIAccount200001005
1008GallyManagerAccount280001008
1010SofiaSales CoordnitorSales200001010
1002GordonMarketing ManagerSales250001007
1003GracieAssistant ManagerSales250001007
1007CellieAssistant ManagerSales250001007
1001JervisDirector of SalesSales300001001

Here, the first row has been assigned value 1009 but for the next two rows have assigned 1006 as both ids 1004 and 1006 have same salary 15000.

Wrapping Up

In this post, we have gone through the windowing function in Hive. We have used all the windowing function including LEAD, LAG, FIRST_VALUE, and LAST_VALUE and have seen how we can apply these function in the query. In addition to this, we checked the windowing specification with different lower and upper boundary expression.

Sharing is caring!

Subscribe to our newsletter
Loading

Leave a Reply