Import data from RDBMS to Hadoop

Import data from RDBMS to Hadoop

Requirement

In this post, we are going to import data from RDBMS to Hadoop. Here, we have MySQL as an RDBMS database. We will use Sqoop to import data from RDBMS to Hadoop.

Components Involved

  • MySQL – For source data
  • HDFS – To store source data in Hadoop
  • Sqoop – Tool to import data from MySQL to HDFS

Sample Data

Below are some sample records for this requirement:

sampledata
 
  1. id,first_name,last_name,gender,designation,city,country
  2. 1,Thayne,Mocher,Male,Administrative Officer,Port Colborne,Canada
  3. 2,Shelly,Bulfoot,Female,Analyst Programmer,Bacong,Philippines
  4. 3,Hercule,Chorlton,Male,Operator,Al Mazzūnah,Tunisia
  5. 4,Thorstein,Epton,Male,Administrative Officer,Tayirove,Ukraine
  6. 5,Madelena,Savin,Female,Help Desk Technician,Jinjiang,China
  7. 6,Adeline,Imesson,Female,Legal Assistant,Fort Beaufort,South Africa
  8. 7,Ambros,Richards,Male,Chemical Engineer,Dubiecko,Poland
  9. 8,Lilas,Harrowing,Female,Assistant Media Planner,Guayatá,Colombia
  10. 9,Freida,Leivers,Female,Legal Assistant,Bangus Kulon,Indonesia
  11. 10,Celie,Dolligon,Female,Data Coordiator,Paraty,Brazil
  12. 11,Berkley,Orteaux,Male,Assistant Professor,Zinder,Niger
  13. 12,Gilburt,Minot,Male,Marketing Assistant,Hanyuan,China
  14. 13,Blaine,Treverton,Male,Research Associate,Yuankeng,China
  15. 14,Benjamen,Dodd,Male,Assistant Professor,Beberon,Philippines
  16. 15,Nikos,Worpole,Male,Human Resources Assistant II,Esmeralda,Cuba

Solution

If you have MySQL table with data, then jump to step 3 else follow step 1 & 2.

Step 1: Create Table in MySQL

The first step is to have a table in MySQL. So, in this step, we will create a table named employee under the database db_bdp.

>mysql

Use below query to create a table:

 
 
USE DATABASE db_bdp;
CREATE TABLE employee (
id INT, 
first_name VARCHAR(100),
last_name VARCHAR(100),
gender VARCHAR(10),
designation VARCHAR(20),
city VARCHAR(20),
country VARCHAR(20));

Step 2: Load data into Table

Below are the sample data:

 
 
insert into employee (id, first_name, last_name, gender, designation, city, country) values (1, 'Jervis', 'Roll', 'Male', 'Director of Sales', 'Thi Tran Lac', 'Vietnam');
insert into employee (id, first_name, last_name, gender, designation, city, country) values (2, 'Gordon', 'Maltster', 'Male', 'Marketing Manager', 'Mabu', 'China');
insert into employee (id, first_name, last_name, gender, designation, city, country) values (3, 'Griff', 'Godsafe', 'Male', 'Actuary', 'Kipit', 'Philippines');
insert into employee (id, first_name, last_name, gender, designation, city, country) values (4, 'Gracie', 'Franken', 'Female', 'Assistant Manager', 'Xiabuji', 'China');
insert into employee (id, first_name, last_name, gender, designation, city, country) values (5, 'Joelly', 'Wellbank', 'Female', 'Account Coordinator', 'Whitehorse', 'Canada');
insert into employee (id, first_name, last_name, gender, designation, city, country) values (6, 'Bab', 'Havock', 'Female', 'Accountant II', 'Basyūn', 'Egypt');
insert into employee (id, first_name, last_name, gender, designation, city, country) values (7, 'Carmine', 'Courage', 'Female', 'Account Coordinator', 'Boyeros', 'Cuba');
insert into employee (id, first_name, last_name, gender, designation, city, country) values (8, 'Estella', 'Marvell', 'Female', 'Structural Analysis Engineer', 'Stettler', 'Canada');
insert into employee (id, first_name, last_name, gender, designation, city, country) values (9, 'Celie', 'Trevaskiss', 'Female', 'Assistant Manager', 'Criuleni', 'Moldova');
insert into employee (id, first_name, last_name, gender, designation, city, country) values (10, 'Madison', 'Ranyell', 'Male', 'Research Associate', 'Angatel', 'Philippines');
insert into employee (id, first_name, last_name, gender, designation, city, country) values (11, 'Haydon', 'Faughey', 'Male', 'Safety Technician IV', 'Masalovka', 'Russia');
insert into employee (id, first_name, last_name, gender, designation, city, country) values (12, 'Michele', 'Zarfai', 'Male', 'Legal Assistant', 'Karatau', 'Kazakhstan');
insert into employee (id, first_name, last_name, gender, designation, city, country) values (13, 'Ruthi', 'Bowmer', 'Female', 'Analog Circuit Design manager', 'Peski', 'Russia');
insert into employee (id, first_name, last_name, gender, designation, city, country) values (14, 'Adolphus', 'Pickthorne', 'Male', 'Senior Developer', 'Mae Fa Luang', 'Thailand');
insert into employee (id, first_name, last_name, gender, designation, city, country) values (15, 'Kat', 'Dymocke', 'Female', 'Geological Engineer', 'Markópoulo Oropoú', 'Greece');

Download full insert query and execute it. It will insert data into the table.

employee

Step 3: Data in RDBMS

We have created a table and loaded data into it. Now, let’s see how many records we have in this table.

SELECT * FROM employee LIMIT 10;

Step 4: Create Sqoop Job to ingest data

The template of a sqoop job looks like:

 
 
  1. sqoop import \
  2.       --connect jdbc:mysql://<host>:<port>/<db-name> \
  3.       --driver com.mysql.jdbc.Driver \
  4.       --username <username> \
  5.       --password <password> \
  6.       --table <table_name>

For our job requirement, we have created a sqoop job:

 
 
  1. sqoop import \
  2.       --connect jdbc:mysql://localhost:3306/db_bdp \
  3.       --driver com.mysql.jdbc.Driver \
  4.       --username root \
  5.       --table employee \
  6.       --split-by id \
  7.       --target-dir /user/bdp/sqoop/import

import –  used when ingesting data from RDBMS to Hadoop

connect – Used to connect with the specified connection string

driver – Used to connect to mysql

username – To get access the MySQL table

split-by – It has been given to perform a sequence. We do keep the primary key of the table in split-by. In case, you don’t want to mention any column for a split-by then mention -m 1 instead of –split-by.

target-dir  – Destination location where data will get dump

\ – Given to mention all the rows in a single line command

Step 5: Job Execution

Once we have done with the Sqoop job. Let’s execute it. It will start a MapReduce job

Step 6: Output

When you check the output directory, you will see 2  type of files SUCCESS and part-m-, where SUCCESS represents the completion of the job and part files keeps data.

 
 
hadoop fs -ls /user/bdp/sqoop/import

We can check the number of records using the command:

 
 
hadoop fs -cat /user/bdp/sqoop/import/* | wc -l

Wrapping Up

In this post, we have seen how we can import data from MySQL to HDFS using Sqoop. We have used a couple of arguments in Sqoop Job which includes MySQL configuration, source table, destination path etc. Once data is available in Hadoop ecosystem, you can process this data for analysis.

74
0

Join in hive with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in pyspark with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in spark using scala with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Java UDF to convert String to date in PIG

About Code Many times it happens like you have received data from many systems and each system operates on a ...
Read More
/ java udf, Pig, pig, pig udf, string to date, udf

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.