Requirement
In this post, we are going to import data from RDBMS to Hadoop. Here, we have MySQL as an RDBMS database. We will use Sqoop to import data from RDBMS to Hadoop.
Components Involved
- MySQL – For source data
- HDFS – To store source data in Hadoop
- Sqoop – Tool to import data from MySQL to HDFS
Sample Data
Below are some sample records for this requirement:
id,first_name,last_name,gender,designation,city,country 1,Thayne,Mocher,Male,Administrative Officer,Port Colborne,Canada 2,Shelly,Bulfoot,Female,Analyst Programmer,Bacong,Philippines 3,Hercule,Chorlton,Male,Operator,Al Mazzūnah,Tunisia 4,Thorstein,Epton,Male,Administrative Officer,Tayirove,Ukraine 5,Madelena,Savin,Female,Help Desk Technician,Jinjiang,China 6,Adeline,Imesson,Female,Legal Assistant,Fort Beaufort,South Africa 7,Ambros,Richards,Male,Chemical Engineer,Dubiecko,Poland 8,Lilas,Harrowing,Female,Assistant Media Planner,Guayatá,Colombia 9,Freida,Leivers,Female,Legal Assistant,Bangus Kulon,Indonesia 10,Celie,Dolligon,Female,Data Coordiator,Paraty,Brazil 11,Berkley,Orteaux,Male,Assistant Professor,Zinder,Niger 12,Gilburt,Minot,Male,Marketing Assistant,Hanyuan,China 13,Blaine,Treverton,Male,Research Associate,Yuankeng,China 14,Benjamen,Dodd,Male,Assistant Professor,Beberon,Philippines 15,Nikos,Worpole,Male,Human Resources Assistant II,Esmeralda,Cuba
Solution
If you have MySQL table with data, then jump to step 3 else follow step 1 & 2.
Step 1: Create Table in MySQL
The first step is to have a table in MySQL. So, in this step, we will create a table named employee under the database db_bdp.
>mysql
Use below query to create a table:
USE DATABASE db_bdp; CREATE TABLE employee ( id INT, first_name VARCHAR(100), last_name VARCHAR(100), gender VARCHAR(10), designation VARCHAR(20), city VARCHAR(20), country VARCHAR(20));
Step 2: Load data into Table
Below are the sample data:
insert into employee (id, first_name, last_name, gender, designation, city, country) values (1, 'Jervis', 'Roll', 'Male', 'Director of Sales', 'Thi Tran Lac', 'Vietnam'); insert into employee (id, first_name, last_name, gender, designation, city, country) values (2, 'Gordon', 'Maltster', 'Male', 'Marketing Manager', 'Mabu', 'China'); insert into employee (id, first_name, last_name, gender, designation, city, country) values (3, 'Griff', 'Godsafe', 'Male', 'Actuary', 'Kipit', 'Philippines'); insert into employee (id, first_name, last_name, gender, designation, city, country) values (4, 'Gracie', 'Franken', 'Female', 'Assistant Manager', 'Xiabuji', 'China'); insert into employee (id, first_name, last_name, gender, designation, city, country) values (5, 'Joelly', 'Wellbank', 'Female', 'Account Coordinator', 'Whitehorse', 'Canada'); insert into employee (id, first_name, last_name, gender, designation, city, country) values (6, 'Bab', 'Havock', 'Female', 'Accountant II', 'Basyūn', 'Egypt'); insert into employee (id, first_name, last_name, gender, designation, city, country) values (7, 'Carmine', 'Courage', 'Female', 'Account Coordinator', 'Boyeros', 'Cuba'); insert into employee (id, first_name, last_name, gender, designation, city, country) values (8, 'Estella', 'Marvell', 'Female', 'Structural Analysis Engineer', 'Stettler', 'Canada'); insert into employee (id, first_name, last_name, gender, designation, city, country) values (9, 'Celie', 'Trevaskiss', 'Female', 'Assistant Manager', 'Criuleni', 'Moldova'); insert into employee (id, first_name, last_name, gender, designation, city, country) values (10, 'Madison', 'Ranyell', 'Male', 'Research Associate', 'Angatel', 'Philippines'); insert into employee (id, first_name, last_name, gender, designation, city, country) values (11, 'Haydon', 'Faughey', 'Male', 'Safety Technician IV', 'Masalovka', 'Russia'); insert into employee (id, first_name, last_name, gender, designation, city, country) values (12, 'Michele', 'Zarfai', 'Male', 'Legal Assistant', 'Karatau', 'Kazakhstan'); insert into employee (id, first_name, last_name, gender, designation, city, country) values (13, 'Ruthi', 'Bowmer', 'Female', 'Analog Circuit Design manager', 'Peski', 'Russia'); insert into employee (id, first_name, last_name, gender, designation, city, country) values (14, 'Adolphus', 'Pickthorne', 'Male', 'Senior Developer', 'Mae Fa Luang', 'Thailand'); insert into employee (id, first_name, last_name, gender, designation, city, country) values (15, 'Kat', 'Dymocke', 'Female', 'Geological Engineer', 'Markópoulo Oropoú', 'Greece');
Download full insert query and execute it. It will insert data into the table.
Step 3: Data in RDBMS
We have created a table and loaded data into it. Now, let’s see how many records we have in this table.
SELECT * FROM employee LIMIT 10;
Step 4: Create Sqoop Job to ingest data
The template of a sqoop job looks like:
sqoop import \ --connect jdbc:mysql://<host>:<port>/<db-name> \ --driver com.mysql.jdbc.Driver \ --username <username> \ --password <password> \ --table <table_name>
For our job requirement, we have created a sqoop job:
sqoop import \ --connect jdbc:mysql://localhost:3306/db_bdp \ --driver com.mysql.jdbc.Driver \ --username root \ --table employee \ --split-by id \ --target-dir /user/bdp/sqoop/import
import – used when ingesting data from RDBMS to Hadoop
connect – Used to connect with the specified connection string
driver – Used to connect to mysql
username – To get access the MySQL table
split-by – It has been given to perform a sequence. We do keep the primary key of the table in split-by. In case, you don’t want to mention any column for a split-by then mention -m 1 instead of –split-by.
target-dir – Destination location where data will get dump
\ – Given to mention all the rows in a single line command
Step 5: Job Execution
Once we have done with the Sqoop job. Let’s execute it. It will start a MapReduce job
Step 6: Output
When you check the output directory, you will see 2 type of files SUCCESS and part-m-, where SUCCESS represents the completion of the job and part files keeps data.
hadoop fs -ls /user/bdp/sqoop/import
We can check the number of records using the command:
hadoop fs -cat /user/bdp/sqoop/import/* | wc -l
Wrapping Up
In this post, we have seen how we can import data from MySQL to HDFS using Sqoop. We have used a couple of arguments in Sqoop Job which includes MySQL configuration, source table, destination path etc. Once data is available in Hadoop ecosystem, you can process this data for analysis.