Requirement
In this post, we are going to explore windowing functions in Hive. These are the windowing functions:
- LEAD
- LAD
- FIRST_VALUE
- LAST_VALUE
- MIN/MAX/COUNT/AVG OVER Clause
Component Involved
- Hive
Sample Data
ID | FIRST_NAME | LAST_NAME | DESIGNATION | DEPARTMENT | SALARY |
1001 | Jervis | Roll | Director of Sales | Sales | 30000 |
1002 | Gordon | Mattster | Marketing Manager | Sales | 25000 |
1003 | Gracie | Fronllen | Assistant Manager | Sales | 25000 |
1004 | Joelly | Wellback | Account Coordinator | Account | 15000 |
1005 | Bob | Havock | Accountant II | Account | 20000 |
1006 | Carmiae | Courage | Account Coordinator | Account | 15000 |
1007 | Cellie | Trevaskiss | Assistant Manager | Sales | 25000 |
1008 | Gally | Johnson | Manager | Account | 28000 |
1009 | Richard | Grill | Account Coordinator | Account | 12000 |
1010 | Sofia | Ketty | Sales Coordinator | Sales | 20000 |
Solution
Let’s first discuss what is windowing?
Windowing allows features to create a window on the set of data in order to operate aggregation like COUNT, AVG, MIN, MAX and other analytical functions such as LEAD, LAG, FIRST_VALUE, and LAST_VALUE.
The syntax of the query with windows:
SELECT <columns_name>, <aggregate>(column_name) OVER (<windowing specification>) FROM <table_name>;
where,
column_name – column name of the table
Aggregate – Any aggregate function(s) like COUNT, AVG, MIN, MAX
Windowing specification – It includes following:
- PARTITION BY – Takes a column(s) of the table as a reference.
- ORDER BY – Specified the Order of column(s) either Ascending or Descending.
- Frame – Specified the boundary of the frame by stat and end value. The boundary either be a type of RANGE or ROW followed by PRECEDING, FOLLOWING and any value.
These three (PARTITION, ORDER BY, and Window frame) are either be alone or together.
PARTITION BY
Count Employees in each department
SELECT department, COUNT(id) OVER (PARTITION BY department) FROM emp_dept_tbl;
This above output has employees count for each department. Here, getting duplicate records. In order to get the only distinct record, we can use the query like below:
SELECT DISTINCT * FROM (SELECT department, COUNT(id) OVER (PARTITION BY department) FROM emp_dept_tbl) A;
ORDER BY
Case I: Without PARTITION
Count Employee with salary descending order
SELECT id, department, salary, COUNT(id) OVER (ORDER BY salary DESC) FROM emp_dept_tbl;
Case II: With PARTITION
Count Employees of each department order by salary
SELECT id, department, salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC) FROM emp_dept_tbl;
WINDOWING Specification
In the windowing frame, you can define the subset of rows in which the windowing function will work. You can specify this subset using upper and lower boundary value using windowing specification.
The syntax to defined windowing specification with ROW/RANGE looks like:
ROW|RANGE BETWEEN <upper expression> AND <lower expression>
Here,
UPPER EXPRESSION can have these 3 value:
- UNBOUNDED PRECEDING – It denotes window will start from the first row of the group/partition.
- CURRENT ROW – Window will start from the current row.
- <INTEGER VALUE> PRECEDING – Provide any specific row to start window
LOWER EXPRESSION
- UNBOUNDED FOLLOWING – It means the window will end at the last row of the group/partition.
- CURRENT ROW – Window will end at the current row
- <INTEGER VALUE> FOLLOWING – Window will end at specific row
Now, let’s use these different upper and lower expression as a combination on different cases and check how it is working.
UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING
Case I: Count Employees with windowing specification
SELECT id, first_name, designation, department, COUNT(id) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM emp_dept_tbl;
It assigned final count of record to each record i.e. 10.
Case II: Count Employees with PARTITION BY department and windowing specification
SELECT id, first_name, designation, department, COUNT(id) OVER (PARTITION BY department ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM emp_dept_tbl;
Here, it has counted the total no. of rows based on the partitioned by the department and assigned the value to each row.
Case III: Count Employees with PARTITION BY department and ORDER BY salary DESC and windowing specification
SELECT id, first_name, designation, department, salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM emp_dept_tbl;
The output seems similar to the previous one except for the order of row based on the salary descending order.
UNBOUNDED PRECEDING and CURRENT ROW
Case I: Count Employees with windowing specification only
SELECT id, first_name, designation, department, salary, COUNT(id) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) FROM emp_dept_tbl;
Here, the row has been assigned its current position to the count value without any specific order.
Case II: Count Employees with PARTITION BY department and windowing specification
SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) FROM emp_dept_tbl;
Here, the row has started and gone till the end of each department that’s why Account and Sales department’s rows having count value 1 to 5 respectively.
Case III: Count Employees with PARTITION BY department and ORDER BY salary DESC and windowing specification
SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) FROM emp_dept_tbl;
Here, the output seems similar to the previous one except for the order of each row.
CURRENT ROW AND UNBOUNDED FOLLOWING
Case I: Count Employees with windowing specification only
SELECT id, first_name, designation, department, salary, COUNT(id) OVER (ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ) FROM emp_dept_tbl;
Case II: Count Employees with PARTITION BY department and windowing specification
SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) FROM emp_dept_tbl;
Case III: Count Employees with PARTITION BY department and ORDER BY salary DESC and windowing specification
SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) FROM emp_dept_tbl;
CURRENT ROW AND 3 FOLLOWING
SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING) FROM emp_dept_tbl;
3 PRECEDING AND 3 FOLLOWING
SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING) FROM emp_dept_tbl;
3 PRECEDING AND CURRENT ROW
SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) FROM emp_dept_tbl;
3 PRECEDING AND UNBOUNDED FOLLOWING
SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC ROWS BETWEEN 3 PRECEDING AND UNBOUNDED FOLLOWING) FROM emp_dept_tbl;
UNBOUNDED PRECEDING AND 3 FOLLOWING
SELECT id, first_name, designation, department,salary, COUNT(id) OVER (PARTITION BY department ORDER BY salary DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 3 FOLLOWING) FROM emp_dept_tbl;
LEAD
It is an analytics function used to return the data from the next set of rows. By default, the lead is of 1 row and it will return NULL in case it exceeds the current window.
SELECT id, first_name, designation, department,salary, LEAD(id) OVER (PARTITION BY department ORDER BY salary) FROM emp_dept_tbl;
ID | FIRST_NAME | DESIGNATION | DEPARTMENT | SALARY | LEAD |
1009 | Richard | Account Coordinator | Account | 12000 | 1004 |
1004 | Joelly | Account Coordinator | Account | 15000 | 1006 |
1006 | Carmiae | Account Coordinator | Account | 15000 | 1005 |
1005 | Bob | Accountant II | Account | 20000 | 1008 |
1008 | Gally | Manager | Account | 28000 | NULL |
1010 | Sofia | Sales Coordnitor | Sales | 20000 | 1002 |
1002 | Gordon | Marketing Manager | Sales | 25000 | 1003 |
1003 | Gracie | Assistant Manager | Sales | 25000 | 1007 |
1007 | Cellie | Assistant Manager | Sales | 25000 | 1001 |
1001 | Jervis | Director of Sales | Sales | 30000 | NULL |
LAG
It is the opposite of LEAD function, it returns the data from the previous set of data. By default lag is of 1 row and return NULL in case the lag for the current row is exceeded before the beginning of the window:
SELECT id, first_name, designation, department,salary, LAG(id) OVER (PARTITION BY department ORDER BY salary) FROM emp_dept_tbl;
ID | FIRST_NAME | DESIGNATION | DEPARTMENT | SALARY | LAG |
1009 | Richard | Account Coordinator | Account | 12000 | NULL |
1004 | Joelly | Account Coordinator | Account | 15000 | 1009 |
1006 | Carmiae | Account Coordinator | Account | 15000 | 1004 |
1005 | Bob | Accountant II | Account | 20000 | 1006 |
1008 | Gally | Manager | Account | 28000 | 1005 |
1010 | Sofia | Sales Coordnitor | Sales | 20000 | NULL |
1002 | Gordon | Marketing Manager | Sales | 25000 | 1010 |
1003 | Gracie | Assistant Manager | Sales | 25000 | 1002 |
1007 | Cellie | Assistant Manager | Sales | 25000 | 1003 |
1001 | Jervis | Director of Sales | Sales | 30000 | 1007 |
FIRST_VALUE
This function returns the value from the first row in the window based on the clause and assigned to all the rows of the same group:
SELECT id, first_name, designation, department,salary, FIRST_VALUE(id) OVER (PARTITION BY department ORDER BY salary) FROM emp_dept_tbl;
ID | FIRST_NAME | DESIGNATION | DEPARTMENT | SALARY | FIRST_VALUE |
1009 | Richard | Account Coordinator | Account | 12000 | 1009 |
1004 | Joelly | Account Coordinator | Account | 15000 | 1009 |
1006 | Carmiae | Account Coordinator | Account | 15000 | 1009 |
1005 | Bob | Accountant II | Account | 20000 | 1009 |
1008 | Gally | Manager | Account | 28000 | 1009 |
1010 | Sofia | Sales Coordnitor | Sales | 20000 | 1010 |
1002 | Gordon | Marketing Manager | Sales | 25000 | 1010 |
1003 | Gracie | Assistant Manager | Sales | 25000 | 1010 |
1007 | Cellie | Assistant Manager | Sales | 25000 | 1010 |
1001 | Jervis | Director of Sales | Sales | 30000 | 1010 |
LAST_VALUE
In reverse of FIRST_VALUE, it return the value from the last row in a window based on the clause and assigned to all the rows of the same group:
ID | FIRST_NAME | DESIGNATION | DEPARTMENT | SALARY | LAST_VALUE |
1009 | Richard | Account Coordinator | Account | 12000 | 1009 |
1004 | Joelly | Account Coordinator | Account | 15000 | 1006 |
1006 | Carmiae | Account Coordinator | Account | 15000 | 1006 |
1005 | Bob | Accountant II | Account | 20000 | 1005 |
1008 | Gally | Manager | Account | 28000 | 1008 |
1010 | Sofia | Sales Coordnitor | Sales | 20000 | 1010 |
1002 | Gordon | Marketing Manager | Sales | 25000 | 1007 |
1003 | Gracie | Assistant Manager | Sales | 25000 | 1007 |
1007 | Cellie | Assistant Manager | Sales | 25000 | 1007 |
1001 | Jervis | Director of Sales | Sales | 30000 | 1001 |
Here, the first row has been assigned value 1009 but for the next two rows have assigned 1006 as both ids 1004 and 1006 have same salary 15000.
Wrapping Up
In this post, we have gone through the windowing function in Hive. We have used all the windowing function including LEAD, LAG, FIRST_VALUE, and LAST_VALUE and have seen how we can apply these function in the query. In addition to this, we checked the windowing specification with different lower and upper boundary expression.