Split one column into multiple columns in hive

Requirement

Suppose, you have one table in hive with one column and you want to split this column into multiple columns and then store the results into another Hive table.

Solution

Assume the name of hive table is “transact_tbl” and it has one column named as “connections”, and values in connections column are comma separated and total two commas are present in each value.

Step 1: Create Hive Table

Create an input table transact_tbl in bdp schema using below command.

 CREATE TABLE bdp.transact_tbl(connections STRING);

 Step 2: Insert Data in Hive

Insert data into connections columns, String should be comma separated. For e.g. “PSTP, POST, FRDE” is one value of connections column.

INSERT INTO TABLE transact_tbl VALUES('PSTP,POST,FRDE'), 
('LEFD,HSYT,YDTH'), 
('AQSV,NESA,LRAJ'), 
('KSAI,SJAA,PSOA'),
('SHAG,JSGS,AGAQ'), 
('AQDS,KAHY,AGQG'),
('BAGW,AHQG,HAFY');

Step 3: Show Hive data

Let’s see how data looks in a transact_tbl table using below command:

 SELECT connections FROM bdp.transact_tbl;

It should show output as below:-

Step 4: Create Output hive table

Create an output table where you want to store split values. Run below command in the hive:

 CREATE TABLE optrans_tbl(con1 STRING,con2 STRING,con3 STRING);

Step 5: Use Hive function

There is a built-in function SPLIT in the hive which expects two arguments, the first argument is a string and the second argument is the pattern by which string should separate.

It will convert String into an array, and desired value can be fetched using the right index of an array.

Use below query to store split records in the hive table:-

INSERT INTO TABLE bdp.optrans_tbl 
SELECT CONCTNS.splitted_cnctns[0] AS con1, 
CONCTNS.splitted_cnctns[1] AS con2, 
CONCTNS.splitted_cnctns[2] AS con3 
FROM (SELECT split(connections,',') AS splitted_cnctns 
FROM bdp.transact_tbl)CONCTNS;

In the above query, you can see that splitted_cnctns is an array with three values in it, which can be extracted using the proper index as con1, con2, and con3.

Inner query is used to get the array of split values and the outer query is used to assign each value to a separate column.

Step 6:  Show output

Let’s see how data looks like in an optrans_tbl table using below command:

 SELECT con1,con2,con3 FROM bdp.optrans_tbl;

It should show output as below:-

 Keep Learning 🙂

Don’t miss the tutorial on Top Big data courses on Udemy you should Buy

Sharing is caring!

Subscribe to our newsletter
Loading

2 Comments

  1. Kumar Sourabh

    what if i want one more column other than the split String, say transact_tbl table have one more column amount(Every connections have different amount) and i want amount column as well in the optrans_tbl in order to use JOIN. What should i do for that?

Leave a Reply