Analytics Functions in Hive

Analytics Functions in Hive

Requirement

In this post, we are going to explore analytics functions in Hive. These are the following analytics function available in the hive:

  • ROW_NUMBER
  • RANK
  • DENSE_RANK
  • CUME_DIST
  • PERCENT_RANK
  • NTILE

Component Involved

  • Hive

Sample Data

ID

FIRST_NAME

LAST_NAME

DESIGNATION

DEPARTMENT

SALARY

1001

Jervis

Roll

Director of Sales

Sales

30000

1002

Gordon

Mattster

Marketing Manager

Sales

25000

1003

Gracie

Fronllen

Assistant Manager

Sales

25000

1004

Joelly

Wellback

Account Coordinator

Account

15000

1005

Bob

Havock

Accountant II

Account

20000

1006

Carmiae

Courage

Account Coordinator

Account

15000

1007

Cellie

Trevaskiss

Assistant Manager

Sales

25000

1008

Gally

Johnson

Manager

Account

28000

1009

Richard

Grill

Account Coordinator

Account

12000

1010

Sofia

Ketty

Sales Coordinator

Sales

20000

Solution

We have a table named emp_dept_tbl which is having above sample data. Our task is to rank the records based on the department and salary wise. We will do this using three functions available in the hive – ROW NUMBER, RANK and DENSE RANK. In addition to this, we will also check others analytics functions like CUME_DIST, NTILE, and PERCENT_RANK.

Step 1: Data in Hive Table

Let’s check the table and data which we will use in the next steps.

Step 2: Ranking Records Using Analytics Function

In this step, we are going to rank each record based on the department with the salary of an employee in each department in descending order.

ROW NUMBER

This is an analytics function which represents each row with a unique value(1,2,3,….) based on the column value used in OVER clause. Here, we are having 10 rows and will use ROW_NUMBER on these records.

 
 
SELECT department, salary, ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) 
as row_num FROM emp_dept_tbl;
DEPT SALARY ROW_NUM
Account 28000 1
Account 20000 2
Account 15000 3
Account 15000 4
Account 12000 5
Sales 30000 1
Sales 25000 2
Sales 25000 3
Sales 25000 4
Sales 20000 5

Here, each row in each department has been assigned a unique value.

RANK

The RANK analytics function is used to assign a rank to the rows based on the column values in OVER clause. The row with equal values assigned the same rank with next rank value skipped.

 
 
SELECT department, salary, RANK() OVER(PARTITION BY department ORDER BY salary DESC) as rnk FROM emp_dept_tbl;
DEPARTMENT SALARY RNK
Account 28000 1
Account 20000 2
Account 15000 3
Account 15000 3
Account 12000 5
Sales 30000 1
Sales 25000 2
Sales 25000 2
Sales 25000 2
Sales 20000 5

Here, if you see all rows with salary 15000 in department account and 25000 in department sales having the same rank as 3 and 2 respectively and next row with next salary have rank 5 with next rank value skipped.

DENSE_RANK

The DENSE_RANK analytics function in hive used to assign a rank to each row. The rows with equal values receive the same rank and this rank assigned in the sequential order so that no rank values are skipped.

 
 
SELECT department, salary, DENSE_RANK() OVER(PARTITION BY department ORDER BY salary DESC) as dns_rnk FROM emp_dept_tbl;
DEPT SALARY DNS_RNK
Account 28000 1
Account 20000 2
Account 15000 3
Account 15000 3
Account 12000 4
Sales 30000 1
Sales 25000 2
Sales 25000 2
Sales 25000 2
Sales 20000 3

Here, the rank value has not been skipped like in RANK function.

Execute all three in a single query:

 
 
SELECT department as dept, salary as sal, 
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as row_num,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rnk,
DENSE_RANK() OVER(PARTITION BY department ORDER BY salary DESC) as dns_rnk 
FROM emp_dept_tbl;
DEPT SALARY ROW_NUM RNK DNS_RNK
Account 28000 1 1 1
Account 20000 2 2 2
Account 15000 3 3 3
Account 15000 4 3 3
Account 12000 5 5 4
Sales 30000 1 1 1
Sales 25000 2 2 2
Sales 25000 3 2 2
Sales 25000 4 2 2
Sales 20000 5 5 3

CUME_DIST

This function stands for cumulative distribution. It computes the relative position of a column value in a group. Here, we can calculate the cumulative distribution of salaries among all departments. For a row, the cumulative distribution of salary is calculated as:

Cum_dist(salary) = Number of rows with the value lower than or equals to salary / total number of rows in the dataset

 
 
SELECT department, salary, CUME_DIST() OVER (ORDER BY salary) as cum_dist FROM emp_dept_tbl;
DEPT SALARY CUME_DIST
Account 12000 0.1
Account 15000 0.3
Account 15000 0.3
Account 20000 0.5
Sales 20000 0.5
Sales 25000 0.8
Sales 25000 0.8
Sales 25000 0.8
Account 28000 0.9
Sales 30000 1.0

Here, if we take salary 15000 and 20000, the cumulative distributions are 0.3 and 0.5 respectively that means 30% employee’s salaries are less than or equal to 15000 and 50% employee’s salaries are less than or equal to 20000.

PERCENT_RANK

It is very similar to the CUME_DIST function. It ranks the row as a percentage. The first row in any dataset has percent_rank 0 and the return value is of the double type.

Let’s rank the salary department wise as percentage:

 
 
SELECT department, salary, RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rnk, PERCENT_RANK() OVER (PARTITION BY department ORDER BY salary DESC) as perc_rnk FROM emp_dept_tbl;
DEPT SALARY RNK PERC_RANK
Account 28000 1 0
Account 20000 2 0.25
Account 15000 3 0.5
Account 15000 3 0.5
Account 12000 5 1
Sales 30000 1 0
Sales 25000 2 0.25
Sales 25000 2 0.25
Sales 25000 2 0.25
Sales 20000 5 1

Here, first it is getting the rank of each row in the dataset and then calculating the percentage rank using below formula:

Percent_Rank = (rank decreased by 1)/(remaining rows in the group)

For example, if we take salary 15000 from Account and 20000 from the Sales department. The rank of these two rows are 3 and 5 respectively and Total remaining rows in each department is 4, then percent rank would be calculated as:

(3-1)/4 = 0.5 and (5-1)/4 = 1.

Note: We are decreasing by 1 as the first row in each group will start from value 0.

NTILE

It divides the number of rows in a partition into a specific number of ranked groups (bucket) as equally as possible. It returns a bucket member associated with it.

 
 
SELECT department, salary, NTILE(4) OVER (PARTITION BY department ORDER BY salary DESC) as ntile FROM emp_dept_tbl;
DEPT SALARY NTILE
Account 28000 1
Account 20000 1
Account 15000 2
Account 15000 3
Account 12000 4
Sales 30000 1
Sales 25000 1
Sales 25000 2
Sales 25000 3
Sales 20000 4

Here, we have divided the dataset into 4 buckets.

Remainder = (total records in each partition)/NTILES

Remainder  = 5%4 = 1

So, the extra 1 value (the remainder of 5%4) is allocated to bucket 1, which therefore have one more value than rest of the buckets(2, 3, & 4).

Wrapping Up

In this post, we have discussed the analytics function available in Hive. We have seen how can we take top records based on the criteria from the data set, distributed the dataset into the ranked group. The ROW_NUMBER, RANK, and DENSE_RANK are the most used analytics functions in the hive.

0
0

Join in hive with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in pyspark with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in spark using scala with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Java UDF to convert String to date in PIG

About Code Many times it happens like you have received data from many systems and each system operates on a ...
Read More
/ java udf, Pig, pig, pig udf, string to date, udf

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.