Analytics Functions in Hive

Requirement

In this post, we are going to explore analytics functions in Hive. These are the following analytics function available in the hive:

  • ROW_NUMBER
  • RANK
  • DENSE_RANK
  • CUME_DIST
  • PERCENT_RANK
  • NTILE

analytics functions in hive

Component Involved

  • Hive

Sample Data

ID

FIRST_NAME

LAST_NAME

DESIGNATION

DEPARTMENT

SALARY

1001

Jervis

Roll

Director of Sales

Sales

30000

1002

Gordon

Mattster

Marketing Manager

Sales

25000

1003

Gracie

Fronllen

Assistant Manager

Sales

25000

1004

Joelly

Wellback

Account Coordinator

Account

15000

1005

Bob

Havock

Accountant II

Account

20000

1006

Carmiae

Courage

Account Coordinator

Account

15000

1007

Cellie

Trevaskiss

Assistant Manager

Sales

25000

1008

Gally

Johnson

Manager

Account

28000

1009

Richard

Grill

Account Coordinator

Account

12000

1010

Sofia

Ketty

Sales Coordinator

Sales

20000

Solution

We have a table named emp_dept_tbl which is having above sample data. Our task is to rank the records based on the department and salary wise. We will do this using three analytics functions available in the hive – ROW NUMBER, RANK and DENSE RANK. In addition to this, we will also check other analytics functions like CUME_DIST, NTILE, and PERCENT_RANK.

Step 1: Data in Hive Table

Let’s check the table and data which we will use in the next steps.

Step 2: Ranking Records Using Analytics Function

In this step, we are going to rank each record based on the department with the salary of an employee in each department in descending order.

ROW NUMBER

This is an analytics function which represents each row with a unique value(1,2,3,….) based on the column value used in OVER clause. Here, we are having 10 rows and will use ROW_NUMBER on these records.

 SELECT department, salary, ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) 
as row_num FROM emp_dept_tbl;
DEPTSALARYROW_NUM
Account280001
Account200002
Account150003
Account150004
Account120005
Sales300001
Sales250002
Sales250003
Sales250004
Sales200005

Here, each row in each department has been assigned a unique value.

RANK

The RANK analytics function is used to assign a rank to the rows based on the column values in OVER clause. The row with equal values assigned the same rank with next rank value skipped.

 SELECT department, salary, RANK() OVER(PARTITION BY department ORDER BY salary DESC) as rnk FROM emp_dept_tbl;
DEPARTMENTSALARYRNK
Account280001
Account200002
Account150003
Account150003
Account120005
Sales300001
Sales250002
Sales250002
Sales250002
Sales200005

Here, if you see all rows with salary 15000 in department account and 25000 in department sales having the same rank as 3 and 2 respectively and next row with next salary have rank 5 with next rank value skipped.

DENSE_RANK

The DENSE_RANK analytics function in hive used to assign a rank to each row. The rows with equal values receive the same rank and this rank assigned in the sequential order so that no rank values are skipped.

 SELECT department, salary, DENSE_RANK() OVER(PARTITION BY department ORDER BY salary DESC) as dns_rnk FROM emp_dept_tbl;
DEPTSALARYDNS_RNK
Account280001
Account200002
Account150003
Account150003
Account120004
Sales300001
Sales250002
Sales250002
Sales250002
Sales200003

Here, the rank value has not been skipped like in RANK function.

Execute all three in a single query:

 SELECT department as dept, salary as sal, 
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as row_num,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rnk,
DENSE_RANK() OVER(PARTITION BY department ORDER BY salary DESC) as dns_rnk 
FROM emp_dept_tbl;
DEPTSALARYROW_NUMRNKDNS_RNK
Account28000111
Account20000222
Account15000333
Account15000433
Account12000554
Sales30000111
Sales25000222
Sales25000322
Sales25000422
Sales20000553

CUME_DIST

This function stands for cumulative distribution. It computes the relative position of a column value in a group. Here, we can calculate the cumulative distribution of salaries among all departments. For a row, the cumulative distribution of salary is calculated as:

Cum_dist(salary) = Number of rows with the value lower than or equals to salary / total number of rows in the dataset

 SELECT department, salary, CUME_DIST() OVER (ORDER BY salary) as cum_dist FROM emp_dept_tbl;
DEPTSALARYCUME_DIST
Account120000.1
Account150000.3
Account150000.3
Account200000.5
Sales200000.5
Sales250000.8
Sales250000.8
Sales250000.8
Account280000.9
Sales300001.0

Here, if we take salary 15000 and 20000, the cumulative distributions are 0.3 and 0.5 respectively that means 30% employee’s salaries are less than or equal to 15000 and 50% employee’s salaries are less than or equal to 20000.

PERCENT_RANK

It is very similar to the CUME_DIST function. It ranks the row as a percentage. The first row in any dataset has percent_rank 0 and the return value is of the double type.

Let’s rank the salary department wise as percentage:

 SELECT department, salary, RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rnk, PERCENT_RANK() OVER (PARTITION BY department ORDER BY salary DESC) as perc_rnk FROM emp_dept_tbl;
DEPTSALARYRNKPERC_RANK
Account2800010
Account2000020.25
Account1500030.5
Account1500030.5
Account1200051
Sales3000010
Sales2500020.25
Sales2500020.25
Sales2500020.25
Sales2000051

Here, first it is getting the rank of each row in the dataset and then calculating the percentage rank using below formula:

Percent_Rank = (rank decreased by 1)/(remaining rows in the group)

For example, if we take the salary 15000 from Account and 20000 from the Sales department. The rank of these two rows are 3 and 5 respectively and Total remaining rows in each department is 4, then percent rank would be calculated as:

(3-1)/4 = 0.5 and (5-1)/4 = 1.

Note: We are decreasing by 1 as the first row in each group will start from value 0.

NTILE

It divides the number of rows in a partition into a specific number of ranked groups (bucket) as equally as possible. It returns a bucket member associated with it.

 SELECT department, salary, NTILE(4) OVER (PARTITION BY department ORDER BY salary DESC) as ntile FROM emp_dept_tbl;
DEPTSALARYNTILE
Account280001
Account200001
Account150002
Account150003
Account120004
Sales300001
Sales250001
Sales250002
Sales250003
Sales200004

Here, we have divided the dataset into 4 buckets.

Remainder = (total records in each partition)/NTILES

Remainder  = 5%4 = 1

So, the extra 1 value (the remainder of 5%4) is allocated to bucket 1, which therefore have one more value than rest of the buckets(2, 3, & 4).

Wrapping Up

In this post, we have discussed the analytics functions in Hive. We have seen how can we take top records based on the criteria from the data set, distributed the dataset into the ranked group. The ROW_NUMBER, RANK, and DENSE_RANK are the most used analytics functions inthe hive.

You can also read Windowing Functions in Hive

Sharing is caring!

Subscribe to our newsletter
Loading

Leave a Reply