Requirement
Suppose we are having some data in a hive table. The table contains information about company’s quarterly wise profit. Now, the requirement is to find max profit of each company from all quarters.
Sample Data
The record having 5 columns – company name, quarter 1 as Q1, quarter 2 as Q2, quarter 3 as Q3 and quater4 as Q4. The data looks like
NAME | Q1 | Q2 | Q3 | Q4 |
C1 | 10.4 | 12.3 | 9.6 | 11.4 |
C2 | 9.5 | 8.7 | 10.6 | 11.3 |
C3 | 9.2 | 11.7 | 12.2 | 10.8 |
C4 | 12.3 | 10.6 | 11.3 | 9.8 |
Solution
Let’s first create a table and insert the sample data. After that, we will find the solution to achieve the requirement.
Step 1: Create Hive Table
Create a hive table say Company_Finance_Detail. The DDL will look like:
CREATE TABLE Company_Finance_Detail ( name STRING, Q1 DOUBLE, Q2 DOUBLE, Q3 DOUBLE, Q4 DOUBLE ) STORED AS ORC;
Here, we have created hive table with ORC storage.
Step 2: Load data
Use below command to insert data into created hive table.
INSERT INTO Company_Finance_Detail VALUES ('C1', 10.4, 12.3, 9.6, 11.4), ('C2', 9.5, 8.7, 10.6, 11.3), ('C3', 9.2, 11.7, 12.2, 10.8), ('C4', 12.3, 10.6, 11.3, 9.8);
select * from Company_Finance_Detail;
Step 3: Find MAX profit of each Company
Approach 1: Here, we cannot use the max function. Basically, the max function finds the maximum value from a column data. But we have to find the max of each row. So, here we have used below query.
SELECT name, (CASE WHEN Q1 > Q2 AND Q1 > Q3 AND Q1 > Q4 THEN Q1 WHEN Q2 > Q3 AND Q2 > Q4 THEN Q2 WHEN Q3 > Q4 AND Q3 > Q4 THEN Q3 ELSE Q4 END) AS Quarter FROM Company_Finance_Detail;
In the above query, we are using case condition and checking every quarter’s value of each company. We are getting the correct output. But what if more column presents in the table. Then we have to append additional query in case condition for the rest columns.
Approach 2: Instead of using this, we can use a hive function called greatest. This function is available from hive version 1.1 onward.
greatest(T v1, T v2, T v3) – Takes multiple values and return max among them.
Now, let’s use the greatest function:
SELECT name, greatest(Q1, Q2, Q3, Q4) FROM Company_Finance_Detail;
Here, we can pass multiple columns of the same data type to find out max value among them.
Note: The greatest function works fine if your table is in ORC format.
What If Hive Table is not in ORC format
Suppose, we have created a hive table Company_Finance_Detail_TEMP with default storage. So, the DDL will look like:
CREATE TABLE Company_Finance_Detail_TEMP ( name STRING, Q1 DOUBLE, Q2 DOUBLE, Q3 DOUBLE, Q4 DOUBLE );
INSERT INTO Company_Finance_Detail_TEMP VALUES ('C1', 10.4, 12.3, 9.6, 11.4), ('C2', 9.5, 8.7, 10.6, 11.3), ('C3', 9.2, 11.7, 12.2, 10.8), ('C4', 12.3, 10.6, 11.3, 9.8);
Now, let’s use the same query which has been used above to find max profit of each company.
SELECT name, greatest(Q1, Q2, Q3, Q4) FROM Company_Finance_Detail_TEMP;
Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: Error evaluating greatest(q1,q2,q3,q4)
In order to fix this exception, we have to explicitly cast all the columns in their data type. The data type of the quarter column is double. Let’s try the updated query:
SELECT name, greatest(DOUBLE(Q1), DOUBLE(Q2), DOUBLE(Q3), DOUBLE(Q4)) FROM Company_Finance_Detail_TEMP;
Greatest function work if all the columns are of same data type and only primitive data type.
Wrapping Up
In this post, we have achieved how to find max value of a row in hive table. We have used both cases and hive build-in greatest function. Also, we have seen how greatest function behaving with different storage.