Requirement

In this post, we are going to explore windowing functions in Hive. These are the windowing functions:

LEAD
LAD
FIRST_VALUE
LAST_VALUE
MIN/MAX/COUNT/AVG OVER Clause

Component Involved

Hive

Sample Data

ID	FIRST_NAME	LAST_NAME	DESIGNATION	DEPARTMENT	SALARY
1001	Jervis	Roll	Director of Sales	Sales	30000
1002	Gordon	Mattster	Marketing Manager	Sales	25000
1003	Gracie	Fronllen	Assistant Manager	Sales	25000
1004	Joelly	Wellback	Account Coordinator	Account	15000
1005	Bob	Havock	Accountant II	Account	20000
1006	Carmiae	Courage	Account Coordinator	Account	15000
1007	Cellie	Trevaskiss	Assistant Manager	Sales	25000
1008	Gally	Johnson	Manager	Account	28000
1009	Richard	Grill	Account Coordinator	Account	12000
1010	Sofia	Ketty	Sales Coordinator	Sales	20000

Solution

Let’s first discuss what is windowing?

Windowing allows features to create a window on the set of data in order to operate aggregation like COUNT, AVG, MIN, MAX and other analytical functions such as LEAD, LAG, FIRST_VALUE, and LAST_VALUE.

The syntax of the query with windows:

SELECT <columns_name>, <aggregate>(column_name) OVER (<windowing specification>) FROM <table_name>;

where,

column_name – column name of the table

Aggregate – Any aggregate function(s) like COUNT, AVG, MIN, MAX

Windowing specification – It includes following:

PARTITION BY – Takes a column(s) of the table as a reference.
ORDER BY – Specified the Order of column(s) either Ascending or Descending.
Frame – Specified the boundary of the frame by stat and end value. The boundary either be a type of RANGE or ROW followed by PRECEDING, FOLLOWING and any value.

These three (PARTITION, ORDER BY, and Window frame) are either be alone or together.

PARTITION BY

Count Employees in each department

 SELECT department, COUNT(id) OVER (PARTITION BY department) FROM emp_dept_tbl;

This above output has employees count for each department. Here, getting duplicate records. In order to get the only distinct record, we can use the query like below:

 SELECT DISTINCT * FROM (SELECT department, COUNT(id) OVER (PARTITION BY department) FROM emp_dept_tbl) A;

ORDER BY

Case I: Without PARTITION

Count Employee with salary descending order

 SELECT id, department, salary, COUNT(id) OVER (ORDER BY salary DESC) FROM emp_dept_tbl;

Case II: With PARTITION

Count Employees of each department order by salary

 SELECT id, department, salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC) FROM emp_dept_tbl;

WINDOWING Specification

In the windowing frame, you can define the subset of rows in which the windowing function will work. You can specify this subset using upper and lower boundary value using windowing specification.

The syntax to defined windowing specification with ROW/RANGE looks like:

ROW|RANGE BETWEEN <upper expression> AND <lower expression>

Here,

UPPER EXPRESSION can have these 3 value:

UNBOUNDED PRECEDING – It denotes window will start from the first row of the group/partition.
CURRENT ROW – Window will start from the current row.
<INTEGER VALUE> PRECEDING – Provide any specific row to start window

LOWER EXPRESSION

UNBOUNDED FOLLOWING – It means the window will end at the last row of the group/partition.
CURRENT ROW – Window will end at the current row
<INTEGER VALUE> FOLLOWING – Window will end at specific row

Now, let’s use these different upper and lower expression as a combination on different cases and check how it is working.

UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING

Case I: Count Employees with windowing specification

 SELECT id, first_name, designation, department, COUNT(id) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM emp_dept_tbl;

It assigned final count of record to each record i.e. 10.

Case II: Count Employees with PARTITION BY department and windowing specification

 SELECT id, first_name, designation, department, COUNT(id) OVER (PARTITION BY department ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM emp_dept_tbl;

Here, it has counted the total no. of rows based on the partitioned by the department and assigned the value to each row.

Case III: Count Employees with PARTITION BY department and ORDER BY salary DESC and windowing specification

 SELECT id, first_name, designation, department, salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM emp_dept_tbl;

The output seems similar to the previous one except for the order of row based on the salary descending order.

UNBOUNDED PRECEDING and CURRENT ROW

Case I: Count Employees with windowing specification only

 SELECT id, first_name, designation, department, salary, COUNT(id) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) FROM emp_dept_tbl;

Here, the row has been assigned its current position to the count value without any specific order.

Case II: Count Employees with PARTITION BY department and windowing specification

 SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) FROM emp_dept_tbl;

Here, the row has started and gone till the end of each department that’s why Account and Sales department’s rows having count value 1 to 5 respectively.

Case III: Count Employees with PARTITION BY department and ORDER BY salary DESC and windowing specification

 SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) FROM emp_dept_tbl;

Here, the output seems similar to the previous one except for the order of each row.

CURRENT ROW AND UNBOUNDED FOLLOWING

Case I: Count Employees with windowing specification only

 SELECT id, first_name, designation, department, salary, COUNT(id) OVER (ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ) FROM emp_dept_tbl;

Case II: Count Employees with PARTITION BY department and windowing specification

 SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) FROM emp_dept_tbl;

Case III: Count Employees with PARTITION BY department and ORDER BY salary DESC and windowing specification

 SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) FROM emp_dept_tbl;

CURRENT ROW AND 3 FOLLOWING

 SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING) FROM emp_dept_tbl;

3 PRECEDING AND 3 FOLLOWING

 SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING) FROM emp_dept_tbl;

3 PRECEDING AND CURRENT ROW

 SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) FROM emp_dept_tbl;

3 PRECEDING AND UNBOUNDED FOLLOWING

 SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC ROWS BETWEEN 3 PRECEDING AND UNBOUNDED FOLLOWING) FROM emp_dept_tbl;

UNBOUNDED PRECEDING AND 3 FOLLOWING

 SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 3 FOLLOWING) FROM emp_dept_tbl;

LEAD

It is an analytics function used to return the data from the next set of rows. By default, the lead is of 1 row and it will return NULL in case it exceeds the current window.

 SELECT id, first_name, designation, department,salary, LEAD(id) OVER (PARTITION BY department ORDER BY salary) FROM emp_dept_tbl;

ID	FIRST_NAME	DESIGNATION	DEPARTMENT	SALARY	LEAD
1009	Richard	Account Coordinator	Account	12000	1004
1004	Joelly	Account Coordinator	Account	15000	1006
1006	Carmiae	Account Coordinator	Account	15000	1005
1005	Bob	Accountant II	Account	20000	1008
1008	Gally	Manager	Account	28000	NULL
1010	Sofia	Sales Coordnitor	Sales	20000	1002
1002	Gordon	Marketing Manager	Sales	25000	1003
1003	Gracie	Assistant Manager	Sales	25000	1007
1007	Cellie	Assistant Manager	Sales	25000	1001
1001	Jervis	Director of Sales	Sales	30000	NULL

LAG

It is the opposite of LEAD function, it returns the data from the previous set of data. By default lag is of 1 row and return NULL in case the lag for the current row is exceeded before the beginning of the window:

 SELECT id, first_name, designation, department,salary, LAG(id) OVER (PARTITION BY department ORDER BY salary) FROM emp_dept_tbl;

ID	FIRST_NAME	DESIGNATION	DEPARTMENT	SALARY	LAG
1009	Richard	Account Coordinator	Account	12000	NULL
1004	Joelly	Account Coordinator	Account	15000	1009
1006	Carmiae	Account Coordinator	Account	15000	1004
1005	Bob	Accountant II	Account	20000	1006
1008	Gally	Manager	Account	28000	1005
1010	Sofia	Sales Coordnitor	Sales	20000	NULL
1002	Gordon	Marketing Manager	Sales	25000	1010
1003	Gracie	Assistant Manager	Sales	25000	1002
1007	Cellie	Assistant Manager	Sales	25000	1003
1001	Jervis	Director of Sales	Sales	30000	1007

FIRST_VALUE

This function returns the value from the first row in the window based on the clause and assigned to all the rows of the same group:

 SELECT id, first_name, designation, department,salary, FIRST_VALUE(id) OVER (PARTITION BY department ORDER BY salary) FROM emp_dept_tbl;

ID	FIRST_NAME	DESIGNATION	DEPARTMENT	SALARY	FIRST_VALUE
1009	Richard	Account Coordinator	Account	12000	1009
1004	Joelly	Account Coordinator	Account	15000	1009
1006	Carmiae	Account Coordinator	Account	15000	1009
1005	Bob	Accountant II	Account	20000	1009
1008	Gally	Manager	Account	28000	1009
1010	Sofia	Sales Coordnitor	Sales	20000	1010
1002	Gordon	Marketing Manager	Sales	25000	1010
1003	Gracie	Assistant Manager	Sales	25000	1010
1007	Cellie	Assistant Manager	Sales	25000	1010
1001	Jervis	Director of Sales	Sales	30000	1010

LAST_VALUE

In reverse of FIRST_VALUE, it return the value from the last row in a window based on the clause and assigned to all the rows of the same group:

ID	FIRST_NAME	DESIGNATION	DEPARTMENT	SALARY	LAST_VALUE
1009	Richard	Account Coordinator	Account	12000	1009
1004	Joelly	Account Coordinator	Account	15000	1006
1006	Carmiae	Account Coordinator	Account	15000	1006
1005	Bob	Accountant II	Account	20000	1005
1008	Gally	Manager	Account	28000	1008
1010	Sofia	Sales Coordnitor	Sales	20000	1010
1002	Gordon	Marketing Manager	Sales	25000	1007
1003	Gracie	Assistant Manager	Sales	25000	1007
1007	Cellie	Assistant Manager	Sales	25000	1007
1001	Jervis	Director of Sales	Sales	30000	1001

Here, the first row has been assigned value 1009 but for the next two rows have assigned 1006 as both ids 1004 and 1006 have same salary 15000.

Wrapping Up

In this post, we have gone through the windowing function in Hive. We have used all the windowing function including LEAD, LAG, FIRST_VALUE, and LAST_VALUE and have seen how we can apply these function in the query. In addition to this, we checked the windowing specification with different lower and upper boundary expression.

Windowing Functions in Hive

Requirement

Component Involved

Sample Data

Solution

PARTITION BY

ORDER BY

WINDOWING Specification

LEAD

LAG

FIRST_VALUE

LAST_VALUE

Wrapping Up

Leave a Reply Cancel reply

Tags