Requirement
You have one table in hive, and it is needed to process the data of that hive table using pig. To load data directly from the file we generally use PigStorage(), but to load data from hive table we need different loading function. Let’s go into detail step by step.
Solution
Step 1: Load Data
Assume that we don’t have any table in the hive. So, let’s make it first. First start hive CLI, then create and load data into table “profits” which is under bdp schema.
After executing below queries, verify that data is loaded successfully.
Use the below command to create a table:
CREATE SCHEMA IF NOT EXISTS bdp; CREATE TABLE bdp.profits (product_id INT,profit BIGINT);
Use below command to insert data into table profits:
INSERT INTO TABLE bdp.profits VALUES ('123','1365'),('124','3253'),('125','91522'), ('123','51842'),('127','19616'),('128','2433'), ('127','182652'),('130','21632'),('122','21632'), ('127','21632'),('135','21632'),('123','21632'),('135','3282');
Verify data is loaded using below command.
select * from bdp.profits;
Step 2: Import table into pig
As we need to process this dataset using Pig, so let’s go to grunt shell, use below command to enter into grunt shell, remember –useHCatalog is must as we need jars which are required to fetch data from a hive.
pig -useHCatalog;
Let’s have one relation PROFITS in which we can load data from the hive table.
PROFITS = LOAD 'bdp.profits' USING org.apache.hive.hcatalog.pig.HCatLoader();
Step 3: Output
Enter below command to see whether data is loaded or not.
dump PROFITS
dump PROFITS will give below result.
Remember we don’t need to define schema after HCatLoader() unlike PigStorage() because it will directly fetch the schema from hive metastore.
To confirm the schema you can use below command:
Describe PROFITS
Wrapping UP :
While working on the project, we get data from many sources, the hive can be one data source, and if we are dealing with unstructured data and structured data apache pig is best, in that case, we can use HCatloader to import hive table and process it with other datasets.
Keep solving, keep learning.
Don’t miss the tutorial on Top Big data courses on Udemy you should Buy