There is an uncertain number of columns present in the hive table. Sometimes a table can have many numbers of columns and sometimes it can have few numbers of columns. If we want the value of all the columns from the table, then there is no any challenge as we can use ‘*’ from the table. But in some cases, we can have 100 of columns and we do not need few columns value among them. In such situation, we used to mentioned all the columns name in the select query manually. It is very hard to do because of the high number of columns. So, here our requirement is to exclude column(s) from select query in hive.
This is the sample data which we will use for the requirement. This table has six columns Id, First name, Last name, Sports, City, and Country.
There can be more columns present in the table. You can download the sample data from here.
We can exclude the columns using regex expression. For using the regex expression, the properties hive.support.quoted.identifiers is set to none.
Step 1: Create Table and Load Data
In this step, we are creating a hive table for loading the sample data.
Let’s say we are having a hive table named emp_sports which stores employees details (sample data). We can create the table using below DDL and load the data into it:
- CREATE TABLE db_bdpbase.emp_sports(
- id INT,
- firstname STRING,
- lastname STRING,
- sports STRING,
- city STRING,
- country STRING
- ) ROW FORMAT DELIMITED
- FIELDS TERMINATED BY ','
- STORED AS TEXTFILE
- LOAD DATA INPATH '/usr/bdp/hive/sample_data.csv' INTO db_bdpbase.emp_sports;
Here, we have created a table and loaded the sample data.
Let’s check the data in this table:
Step 2: Exclude Column(s)
In this step, we will exclude the column(s) from the select statement. For this, first we have to set the below properties in the hive:
- SET hive.support.quoted.identifiers=NONE;
Let’s say, we don’t want sports columns value. We will use below query to exclude this column.
- SET hive.cli.print.header=TRUE;SELECT `(sports)?+.+` FROM db_bdpbase.emp_sports;
Here, set hive.cli.print.header=true property is used to show the header of the table.
Now, we want to exclude ID also. The select query will look like:
- SELECT `(id|sports)?+.+` FROM db_bdpbase.emp_sports LIMIT 10;
Here, we have kept all excluding columns enclosed by the backquote (`). We can include multiple columns with the pipe(‘|’) separated.
In this post, we have seen how we can exclude a column or multiple columns from the select statement in the hive. When there are many columns present in the table, then this is the best way to exclude the columns which we don’t want rather than mentioning all the required columns of the table.