You have one table in hive, and it is needed to process the data of that hive table using pig.To load data directly from file we generally use PigStorage(),but to load data from hive table we need different loading function. Let’s go into detail step by step.
Step 1: Load Data
Assume that we don’t have any table in hive.so let’s make it first. Let’s first login to hive and load data into table “profits” which is under bdp schema.
After executing below queries, verify that data is loaded successfully.
Use below command to create a table .
- CREATE SCHEMA IF NOT EXISTS bdp;
- CREATE TABLE bdp.profits (product_id INT,profit BIGINT);
Use below command to insert data into table profits.
- INSERT INTO TABLE bdp.profits VALUES
Verify data is loaded using below command.
- select * from bdp.profits;
Step 2: Import table into pig
As we need to process this dataset using Pig so let’s go to grunt shell, use below command to enter into grunt shell, remember –useHCatalog is must as we need jars which are required to fetch data from a hive.
- pig -useHCatalog;
Let’s have one relation PROFITS in which we can load data from hive table.
- PROFITS = LOAD 'bdp.profits' USING org.apache.hive.hcatalog.pig.HCatLoader();
Step 3 Output
Enter below command to see whether data is loaded or not.
- dump PROFITS
dump PROFITS will give below result.
Remember we don’t need to define schema after HCatLoader() unlike PigStorage(),because it will directly fetch the schema from hive metastore.
To confirm the schema you can use below command.
- Describe PROFITS
Wrapping UP :
While working on project we get data from many sources ,hive can be one data source ,and if we are dealing with unstructured data and structured data apache pig is best ,in that case we can use HCatloader to import hive table and process it with other dataset.
Keep solving, keep learning.Subscribe us.