Split one column into multiple columns in hive

Split one column into multiple columns in hive

Requirement

You have one table in hive with one column and you want to split this column into multiple columns and store the results into another hive table.

Solution

Assume the name of hive table is “transact_tbl” and it has one column named as “connections”, and values in connections column are comma separated and total two commas are present in each value.

Step 1: Create Hive Table

Create an input table transact_tbl in bdp schema using below command.

 CREATE TABLE bdp.transact_tbl(connections STRING);

 Step 2: Insert Data in Hive

Insert data into connections columns, String should be comma separated. For e.g. “PSTP, POST, FRDE” is one value of connections column.

INSERT INTO TABLE transact_tbl VALUES('PSTP,POST,FRDE');
INSERT INTO TABLE transact_tbl VALUES('LEFD,HSYT,YDTH'); 
INSERT INTO TABLE transact_tbl VALUES('AQSV,NESA,LRAJ'); 
INSERT INTO TABLE transact_tbl VALUES('KSAI,SJAA,PSOA'); 
INSERT INTO TABLE transact_tbl VALUES('SHAG,JSGS,AGAQ');
INSERT INTO TABLE transact_tbl VALUES('AQDS,KAHY,AGQG'); 
INSERT INTO TABLE transact_tbl VALUES('BAGW,AHQG,HAFY');

Step 3: Show hive data

Let’s see how data looks in a transact_tbl table using below command.

 SELECT connections FROM bdp.transact_tbl;

It should show output as below:-

Step 4: Create Output hive table

Create an output table where you want to store split values. Run below command in the hive.

 CREATE TABLE optrans_tbl(con1 STRING,con2 STRING,con3 STRING);

Step 5: Use hive function

There is built in function SPLIT in the hive which expects two arguments, the first argument is string and second argument is the pattern by which string should separate.

It will convert String into an array, and desired value can be fetched using the right index of an array.

Use below query  to  store split records in the hive:-

 INSERT INTO TABLE bdp.optrans_tbl
SELECT
CONCTNS.splitted_cnctns[0] AS con1,
CONCTNS.splitted_cnctns[1] AS con2,
CONCTNS.splitted_cnctns[2] AS con3
FROM
(SELECT
split(connections,',') AS splitted_cnctns FROM bdp.transact_tbl)CONCTNS;

In above query, you can see that splitted_cnctns is an array with three values in it, which can be extracted using the proper index as con1 con2 and con3.

Inner query is used to get the array of split values and the outer query is used to assign each value to a separate column.

Step 6:  Show output

Let’s see how data looks in an optrans_tbl table using below command.

 SELECT con1,con2,con3 FROM bdp.optrans_tbl;

It should show output as below:-

 Keep Learning 🙂

 

46
0

Join in hive with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in pyspark with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in spark using scala with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Java UDF to convert String to date in PIG

About Code Many times it happens like you have received data from many systems and each system operates on a ...
Read More
/ java udf, Pig, pig, pig udf, string to date, udf

2 Comments

  1. Kumar Sourabh

    what if i want one more column other than the split String, say transact_tbl table have one more column amount(Every connections have different amount) and i want amount column as well in the optrans_tbl in order to use JOIN. What should i do for that?

    0

    0

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.