Find max value of a row in Hive

Requirement

Suppose we are having some data in a hive table. The table contains information about company’s quarterly wise profit. Now, the requirement is to find max profit of each company from all quarters.

Sample Data

The record having 5 columns – company name, quarter 1 as Q1, quarter 2 as Q2, quarter 3 as Q3 and quater4 as Q4. The data looks like

NAME

Q1

Q2

Q3

Q4

C1

10.4

12.3

9.6

11.4

C2

9.5

8.7

10.6

11.3

C3

9.2

11.7

12.2

10.8

C4

12.3

10.6

11.3

9.8

Solution

Let’s first create a table and insert the sample data. After that, we will find the solution to achieve the requirement.

Step 1: Create Hive Table

Create a hive table say Company_Finance_Detail. The DDL will look like:

create-table
 
  1. CREATE TABLE Company_Finance_Detail (
  2.     name STRING,
  3.     Q1 DOUBLE,
  4.     Q2 DOUBLE,
  5.     Q3 DOUBLE,
  6.     Q4 DOUBLE
  7. ) STORED AS ORC;

Here, we have created hive table with ORC storage.

Step 2: Load data

Use below command to insert data into created hive table.

 
 
  1. INSERT INTO Company_Finance_Detail VALUES ('C1', 10.4, 12.3, 9.6, 11.4),                                       ('C2', 9.5, 8.7, 10.6, 11.3),                                            ('C3', 9.2, 11.7, 12.2, 10.8),                                            ('C4', 12.3, 10.6, 11.3, 9.8);

 
 
  1. SELECT * FROM Company_Finance_Detail;

Step 3: Find MAX profit of each Company

Approach 1: Here, we cannot use the max function. Basically, the max function finds the maximum value from a column data. But we have to find the max of each row. So, here we have used below query.

 
 
  1. SELECT name,
  2.        (CASE
  3.             WHEN Q1 > Q2
  4.                  AND Q1 > Q3
  5.                  AND Q1 > Q4 THEN Q1
  6.             WHEN Q2 > Q3
  7.                  AND Q2 > Q4 THEN Q2
  8.             WHEN Q3 > Q4
  9.                  AND Q3 > Q4 THEN Q3
  10.             ELSE Q4
  11.         END) AS Quarter
  12. FROM Company_Finance_Detail;

In the above query, we are using case condition and checking every quarter’s value of each company. We are getting the correct output. But what if more column presents in the table. Then we have to append additional query in case condition for the rest columns.

Approach 2: Instead of using this, we can use a hive function called greatest. This function is available from hive version 1.1 onward.

greatest(T v1, T v2, T v3) – Takes multiple values and return max among them.

Now, let’s use the greatest function:

 
 
  1. SELECT name, greatest(Q1, Q2, Q3, Q4) FROM Company_Finance_Detail;

Here, we can pass multiple columns of the same data type to find out max value among them.

Note: The greatest function works fine if your table is in ORC format.

What If Hive Table is not in ORC format

Suppose, we have created a hive table Company_Finance_Detail_TEMP with default storage. So, the DDL will look like:

 
 
  1. CREATE TABLE Company_Finance_Detail_TEMP (
  2. name STRING,
  3. Q1 DOUBLE,
  4. Q2 DOUBLE,
  5. Q3 DOUBLE,
  6. Q4 DOUBLE
  7. );

 
 
  1. INSERT INTO Company_Finance_Detail_TEMP VALUES ('C1', 10.4, 12.3, 9.6, 11.4),                                            ('C2', 9.5, 8.7, 10.6, 11.3),                                            ('C3', 9.2, 11.7, 12.2, 10.8),                                            ('C4', 12.3, 10.6, 11.3, 9.8);

Now, let’s use the same query which has been used above to find max profit of each company.

 
 
  1. SELECT name, greatest(Q1, Q2, Q3, Q4) FROM Company_Finance_Detail_TEMP;

Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: Error evaluating greatest(q1,q2,q3,q4)

In order to fix this exception, we have to explicitly cast all the columns in their data type. The data type of the quarter column is double. Let’s try the updated query:

 
 
  1. SELECT name, greatest(DOUBLE(Q1), DOUBLE(Q2), DOUBLE(Q3), DOUBLE(Q4)) FROM Company_Finance_Detail_TEMP;

Greatest function work if all the columns are of same data type and only primitive data type.

Wrapping Up

In this post, we have achieved how to find max value of a row in hive table. We have used both cases and hive build-in greatest function. Also, we have seen how greatest function behaving with different storage.

Load CSV file into hive AVRO table

Requirement You have comma separated(CSV) file and you want to create Avro table in hive on top of it, then ...
Read More

Load CSV file into hive PARQUET table

Requirement You have comma separated(CSV) file and you want to create Parquet table in hive on top of it, then ...
Read More

Hive Most Asked Interview Questions With Answers – Part II

What is bucketing and what is the use of it? Answer: Bucket is an optimisation technique which is used to ...
Read More
/ hive, hive interview, interview-qa

Spark Interview Questions Part-1

Suppose you have a spark dataframe which contains millions of records. You need to perform multiple actions on it. How ...
Read More

Leave a Reply