Split one column into multiple columns in hive

Requirement

You have one table in hive with one column and you want to split this column into multiple columns and store the results into another hive table.

Solution

Assume the name of hive table is “transact_tbl” and it has one column named as “connections”, and values in connections column are comma separated and total two commas are present in each value.

Step 1: Create Hive Table

Create an input table transact_tbl in bdp schema using below command.

 CREATE TABLE bdp.transact_tbl(connections STRING);

 Step 2: Insert Data in Hive

Insert data into connections columns, String should be comma separated. For e.g. “PSTP, POST, FRDE” is one value of connections column.

INSERT INTO TABLE transact_tbl VALUES('PSTP,POST,FRDE');
INSERT INTO TABLE transact_tbl VALUES('LEFD,HSYT,YDTH'); 
INSERT INTO TABLE transact_tbl VALUES('AQSV,NESA,LRAJ'); 
INSERT INTO TABLE transact_tbl VALUES('KSAI,SJAA,PSOA'); 
INSERT INTO TABLE transact_tbl VALUES('SHAG,JSGS,AGAQ');
INSERT INTO TABLE transact_tbl VALUES('AQDS,KAHY,AGQG'); 
INSERT INTO TABLE transact_tbl VALUES('BAGW,AHQG,HAFY');

Step 3: Show hive data

Let’s see how data looks in a transact_tbl table using below command.

 SELECT connections FROM bdp.transact_tbl;

It should show output as below:-

Step 4: Create Output hive table

Create an output table where you want to store split values. Run below command in the hive.

 CREATE TABLE optrans_tbl(con1 STRING,con2 STRING,con3 STRING);

Step 5: Use hive function

There is built in function SPLIT in the hive which expects two arguments, the first argument is string and second argument is the pattern by which string should separate.

It will convert String into an array, and desired value can be fetched using the right index of an array.

Use below query  to  store split records in the hive:-

 INSERT INTO TABLE bdp.optrans_tbl
SELECT
CONCTNS.splitted_cnctns[0] AS con1,
CONCTNS.splitted_cnctns[1] AS con2,
CONCTNS.splitted_cnctns[2] AS con3
FROM
(SELECT
split(connections,',') AS splitted_cnctns FROM bdp.transact_tbl)CONCTNS;

In above query, you can see that splitted_cnctns is an array with three values in it, which can be extracted using the proper index as con1 con2 and con3.

Inner query is used to get the array of split values and the outer query is used to assign each value to a separate column.

Step 6:  Show output

Let’s see how data looks in an optrans_tbl table using below command.

 SELECT con1,con2,con3 FROM bdp.optrans_tbl;

It should show output as below:-

 Keep Learning 🙂

 

Load CSV file into hive AVRO table

Requirement You have comma separated(CSV) file and you want to create Avro table in hive on top of it, then ...
Read More

Load CSV file into hive PARQUET table

Requirement You have comma separated(CSV) file and you want to create Parquet table in hive on top of it, then ...
Read More

Hive Most Asked Interview Questions With Answers – Part II

What is bucketing and what is the use of it? Answer: Bucket is an optimisation technique which is used to ...
Read More
/ hive, hive interview, interview-qa

Spark Interview Questions Part-1

Suppose you have a spark dataframe which contains millions of records. You need to perform multiple actions on it. How ...
Read More

2 Comments

  1. Kumar Sourabh

    what if i want one more column other than the split String, say transact_tbl table have one more column amount(Every connections have different amount) and i want amount column as well in the optrans_tbl in order to use JOIN. What should i do for that?

Leave a Reply