Requirement
In this post, we are going to explore analytics functions in Hive. These are the following analytics function available in the hive:
- ROW_NUMBER
- RANK
- DENSE_RANK
- CUME_DIST
- PERCENT_RANK
- NTILE
Component Involved
- Hive
Sample Data
ID | FIRST_NAME | LAST_NAME | DESIGNATION | DEPARTMENT | SALARY |
1001 | Jervis | Roll | Director of Sales | Sales | 30000 |
1002 | Gordon | Mattster | Marketing Manager | Sales | 25000 |
1003 | Gracie | Fronllen | Assistant Manager | Sales | 25000 |
1004 | Joelly | Wellback | Account Coordinator | Account | 15000 |
1005 | Bob | Havock | Accountant II | Account | 20000 |
1006 | Carmiae | Courage | Account Coordinator | Account | 15000 |
1007 | Cellie | Trevaskiss | Assistant Manager | Sales | 25000 |
1008 | Gally | Johnson | Manager | Account | 28000 |
1009 | Richard | Grill | Account Coordinator | Account | 12000 |
1010 | Sofia | Ketty | Sales Coordinator | Sales | 20000 |
Solution
We have a table named emp_dept_tbl which is having above sample data. Our task is to rank the records based on the department and salary wise. We will do this using three analytics functions available in the hive – ROW NUMBER, RANK and DENSE RANK. In addition to this, we will also check other analytics functions like CUME_DIST, NTILE, and PERCENT_RANK.
Step 1: Data in Hive Table
Let’s check the table and data which we will use in the next steps.
Step 2: Ranking Records Using Analytics Function
In this step, we are going to rank each record based on the department with the salary of an employee in each department in descending order.
ROW NUMBER
This is an analytics function which represents each row with a unique value(1,2,3,….) based on the column value used in OVER clause. Here, we are having 10 rows and will use ROW_NUMBER on these records.
SELECT department, salary, ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as row_num FROM emp_dept_tbl;
DEPT | SALARY | ROW_NUM |
Account | 28000 | 1 |
Account | 20000 | 2 |
Account | 15000 | 3 |
Account | 15000 | 4 |
Account | 12000 | 5 |
Sales | 30000 | 1 |
Sales | 25000 | 2 |
Sales | 25000 | 3 |
Sales | 25000 | 4 |
Sales | 20000 | 5 |
Here, each row in each department has been assigned a unique value.
RANK
The RANK analytics function is used to assign a rank to the rows based on the column values in OVER clause. The row with equal values assigned the same rank with next rank value skipped.
SELECT department, salary, RANK() OVER(PARTITION BY department ORDER BY salary DESC) as rnk FROM emp_dept_tbl;
DEPARTMENT | SALARY | RNK |
Account | 28000 | 1 |
Account | 20000 | 2 |
Account | 15000 | 3 |
Account | 15000 | 3 |
Account | 12000 | 5 |
Sales | 30000 | 1 |
Sales | 25000 | 2 |
Sales | 25000 | 2 |
Sales | 25000 | 2 |
Sales | 20000 | 5 |
Here, if you see all rows with salary 15000 in department account and 25000 in department sales having the same rank as 3 and 2 respectively and next row with next salary have rank 5 with next rank value skipped.
DENSE_RANK
The DENSE_RANK analytics function in hive used to assign a rank to each row. The rows with equal values receive the same rank and this rank assigned in the sequential order so that no rank values are skipped.
SELECT department, salary, DENSE_RANK() OVER(PARTITION BY department ORDER BY salary DESC) as dns_rnk FROM emp_dept_tbl;
DEPT | SALARY | DNS_RNK |
Account | 28000 | 1 |
Account | 20000 | 2 |
Account | 15000 | 3 |
Account | 15000 | 3 |
Account | 12000 | 4 |
Sales | 30000 | 1 |
Sales | 25000 | 2 |
Sales | 25000 | 2 |
Sales | 25000 | 2 |
Sales | 20000 | 3 |
Here, the rank value has not been skipped like in RANK function.
Execute all three in a single query:
SELECT department as dept, salary as sal, ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as row_num, RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rnk, DENSE_RANK() OVER(PARTITION BY department ORDER BY salary DESC) as dns_rnk FROM emp_dept_tbl;
DEPT | SALARY | ROW_NUM | RNK | DNS_RNK |
Account | 28000 | 1 | 1 | 1 |
Account | 20000 | 2 | 2 | 2 |
Account | 15000 | 3 | 3 | 3 |
Account | 15000 | 4 | 3 | 3 |
Account | 12000 | 5 | 5 | 4 |
Sales | 30000 | 1 | 1 | 1 |
Sales | 25000 | 2 | 2 | 2 |
Sales | 25000 | 3 | 2 | 2 |
Sales | 25000 | 4 | 2 | 2 |
Sales | 20000 | 5 | 5 | 3 |
CUME_DIST
This function stands for cumulative distribution. It computes the relative position of a column value in a group. Here, we can calculate the cumulative distribution of salaries among all departments. For a row, the cumulative distribution of salary is calculated as:
Cum_dist(salary) = Number of rows with the value lower than or equals to salary / total number of rows in the dataset
SELECT department, salary, CUME_DIST() OVER (ORDER BY salary) as cum_dist FROM emp_dept_tbl;
DEPT | SALARY | CUME_DIST |
Account | 12000 | 0.1 |
Account | 15000 | 0.3 |
Account | 15000 | 0.3 |
Account | 20000 | 0.5 |
Sales | 20000 | 0.5 |
Sales | 25000 | 0.8 |
Sales | 25000 | 0.8 |
Sales | 25000 | 0.8 |
Account | 28000 | 0.9 |
Sales | 30000 | 1.0 |
Here, if we take salary 15000 and 20000, the cumulative distributions are 0.3 and 0.5 respectively that means 30% employee’s salaries are less than or equal to 15000 and 50% employee’s salaries are less than or equal to 20000.
PERCENT_RANK
It is very similar to the CUME_DIST function. It ranks the row as a percentage. The first row in any dataset has percent_rank 0 and the return value is of the double type.
Let’s rank the salary department wise as percentage:
SELECT department, salary, RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rnk, PERCENT_RANK() OVER (PARTITION BY department ORDER BY salary DESC) as perc_rnk FROM emp_dept_tbl;
DEPT | SALARY | RNK | PERC_RANK |
Account | 28000 | 1 | 0 |
Account | 20000 | 2 | 0.25 |
Account | 15000 | 3 | 0.5 |
Account | 15000 | 3 | 0.5 |
Account | 12000 | 5 | 1 |
Sales | 30000 | 1 | 0 |
Sales | 25000 | 2 | 0.25 |
Sales | 25000 | 2 | 0.25 |
Sales | 25000 | 2 | 0.25 |
Sales | 20000 | 5 | 1 |
Here, first it is getting the rank of each row in the dataset and then calculating the percentage rank using below formula:
Percent_Rank = (rank decreased by 1)/(remaining rows in the group)
For example, if we take the salary 15000 from Account and 20000 from the Sales department. The rank of these two rows are 3 and 5 respectively and Total remaining rows in each department is 4, then percent rank would be calculated as:
(3-1)/4 = 0.5 and (5-1)/4 = 1.
Note: We are decreasing by 1 as the first row in each group will start from value 0.
NTILE
It divides the number of rows in a partition into a specific number of ranked groups (bucket) as equally as possible. It returns a bucket member associated with it.
SELECT department, salary, NTILE(4) OVER (PARTITION BY department ORDER BY salary DESC) as ntile FROM emp_dept_tbl;
DEPT | SALARY | NTILE |
Account | 28000 | 1 |
Account | 20000 | 1 |
Account | 15000 | 2 |
Account | 15000 | 3 |
Account | 12000 | 4 |
Sales | 30000 | 1 |
Sales | 25000 | 1 |
Sales | 25000 | 2 |
Sales | 25000 | 3 |
Sales | 20000 | 4 |
Here, we have divided the dataset into 4 buckets.
Remainder = (total records in each partition)/NTILES
Remainder = 5%4 = 1
So, the extra 1 value (the remainder of 5%4) is allocated to bucket 1, which therefore have one more value than rest of the buckets(2, 3, & 4).
Wrapping Up
In this post, we have discussed the analytics functions in Hive. We have seen how can we take top records based on the criteria from the data set, distributed the dataset into the ranked group. The ROW_NUMBER, RANK, and DENSE_RANK are the most used analytics functions inthe hive.
You can also read Windowing Functions in Hive