Requirement

In this post, we are going to explore analytics functions in Hive. These are the following analytics function available in the hive:

ROW_NUMBER
RANK
DENSE_RANK
CUME_DIST
PERCENT_RANK
NTILE

analytics functions in hive

Component Involved

Hive

Sample Data

ID	FIRST_NAME	LAST_NAME	DESIGNATION	DEPARTMENT	SALARY
1001	Jervis	Roll	Director of Sales	Sales	30000
1002	Gordon	Mattster	Marketing Manager	Sales	25000
1003	Gracie	Fronllen	Assistant Manager	Sales	25000
1004	Joelly	Wellback	Account Coordinator	Account	15000
1005	Bob	Havock	Accountant II	Account	20000
1006	Carmiae	Courage	Account Coordinator	Account	15000
1007	Cellie	Trevaskiss	Assistant Manager	Sales	25000
1008	Gally	Johnson	Manager	Account	28000
1009	Richard	Grill	Account Coordinator	Account	12000
1010	Sofia	Ketty	Sales Coordinator	Sales	20000

Solution

We have a table named emp_dept_tbl which is having above sample data. Our task is to rank the records based on the department and salary wise. We will do this using three analytics functions available in the hive – ROW NUMBER, RANK and DENSE RANK. In addition to this, we will also check other analytics functions like CUME_DIST, NTILE, and PERCENT_RANK.

Step 1: Data in Hive Table

Let’s check the table and data which we will use in the next steps.

Step 2: Ranking Records Using Analytics Function

In this step, we are going to rank each record based on the department with the salary of an employee in each department in descending order.

ROW NUMBER

This is an analytics function which represents each row with a unique value(1,2,3,….) based on the column value used in OVER clause. Here, we are having 10 rows and will use ROW_NUMBER on these records.

 SELECT department, salary, ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) 
as row_num FROM emp_dept_tbl;

DEPT	SALARY	ROW_NUM
Account	28000	1
Account	20000	2
Account	15000	3
Account	15000	4
Account	12000	5
Sales	30000	1
Sales	25000	2
Sales	25000	3
Sales	25000	4
Sales	20000	5

Here, each row in each department has been assigned a unique value.

RANK

The RANK analytics function is used to assign a rank to the rows based on the column values in OVER clause. The row with equal values assigned the same rank with next rank value skipped.

 SELECT department, salary, RANK() OVER(PARTITION BY department ORDER BY salary DESC) as rnk FROM emp_dept_tbl;

DEPARTMENT	SALARY	RNK
Account	28000	1
Account	20000	2
Account	15000	3
Account	15000	3
Account	12000	5
Sales	30000	1
Sales	25000	2
Sales	25000	2
Sales	25000	2
Sales	20000	5

Here, if you see all rows with salary 15000 in department account and 25000 in department sales having the same rank as 3 and 2 respectively and next row with next salary have rank 5 with next rank value skipped.

DENSE_RANK

The DENSE_RANK analytics function in hive used to assign a rank to each row. The rows with equal values receive the same rank and this rank assigned in the sequential order so that no rank values are skipped.

 SELECT department, salary, DENSE_RANK() OVER(PARTITION BY department ORDER BY salary DESC) as dns_rnk FROM emp_dept_tbl;

DEPT	SALARY	DNS_RNK
Account	28000	1
Account	20000	2
Account	15000	3
Account	15000	3
Account	12000	4
Sales	30000	1
Sales	25000	2
Sales	25000	2
Sales	25000	2
Sales	20000	3

Here, the rank value has not been skipped like in RANK function.

Execute all three in a single query:

 SELECT department as dept, salary as sal, 
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as row_num,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rnk,
DENSE_RANK() OVER(PARTITION BY department ORDER BY salary DESC) as dns_rnk 
FROM emp_dept_tbl;

DEPT	SALARY	ROW_NUM	RNK	DNS_RNK
Account	28000	1	1	1
Account	20000	2	2	2
Account	15000	3	3	3
Account	15000	4	3	3
Account	12000	5	5	4
Sales	30000	1	1	1
Sales	25000	2	2	2
Sales	25000	3	2	2
Sales	25000	4	2	2
Sales	20000	5	5	3

CUME_DIST

This function stands for cumulative distribution. It computes the relative position of a column value in a group. Here, we can calculate the cumulative distribution of salaries among all departments. For a row, the cumulative distribution of salary is calculated as:

Cum_dist(salary) = Number of rows with the value lower than or equals to salary / total number of rows in the dataset

 SELECT department, salary, CUME_DIST() OVER (ORDER BY salary) as cum_dist FROM emp_dept_tbl;

DEPT	SALARY	CUME_DIST
Account	12000	0.1
Account	15000	0.3
Account	15000	0.3
Account	20000	0.5
Sales	20000	0.5
Sales	25000	0.8
Sales	25000	0.8
Sales	25000	0.8
Account	28000	0.9
Sales	30000	1.0

Here, if we take salary 15000 and 20000, the cumulative distributions are 0.3 and 0.5 respectively that means 30% employee’s salaries are less than or equal to 15000 and 50% employee’s salaries are less than or equal to 20000.

PERCENT_RANK

It is very similar to the CUME_DIST function. It ranks the row as a percentage. The first row in any dataset has percent_rank 0 and the return value is of the double type.

Let’s rank the salary department wise as percentage:

 SELECT department, salary, RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rnk, PERCENT_RANK() OVER (PARTITION BY department ORDER BY salary DESC) as perc_rnk FROM emp_dept_tbl;

DEPT	SALARY	RNK	PERC_RANK
Account	28000	1	0
Account	20000	2	0.25
Account	15000	3	0.5
Account	15000	3	0.5
Account	12000	5	1
Sales	30000	1	0
Sales	25000	2	0.25
Sales	25000	2	0.25
Sales	25000	2	0.25
Sales	20000	5	1

Here, first it is getting the rank of each row in the dataset and then calculating the percentage rank using below formula:

Percent_Rank = (rank decreased by 1)/(remaining rows in the group)

For example, if we take the salary 15000 from Account and 20000 from the Sales department. The rank of these two rows are 3 and 5 respectively and Total remaining rows in each department is 4, then percent rank would be calculated as:

(3-1)/4 = 0.5 and (5-1)/4 = 1.

Note: We are decreasing by 1 as the first row in each group will start from value 0.

NTILE

It divides the number of rows in a partition into a specific number of ranked groups (bucket) as equally as possible. It returns a bucket member associated with it.

 SELECT department, salary, NTILE(4) OVER (PARTITION BY department ORDER BY salary DESC) as ntile FROM emp_dept_tbl;

DEPT	SALARY	NTILE
Account	28000	1
Account	20000	1
Account	15000	2
Account	15000	3
Account	12000	4
Sales	30000	1
Sales	25000	1
Sales	25000	2
Sales	25000	3
Sales	20000	4

Here, we have divided the dataset into 4 buckets.

Remainder = (total records in each partition)/NTILES

Remainder = 5%4 = 1

So, the extra 1 value (the remainder of 5%4) is allocated to bucket 1, which therefore have one more value than rest of the buckets(2, 3, & 4).

Wrapping Up

In this post, we have discussed the analytics functions in Hive. We have seen how can we take top records based on the criteria from the data set, distributed the dataset into the ranked group. The ROW_NUMBER, RANK, and DENSE_RANK are the most used analytics functions inthe hive.

You can also read Windowing Functions in Hive

Analytics Functions in Hive