How to find the number of records using Map Reduce

Requirement

In real time scenario, data files contain many records. Also, there may be many data files available. In that case, it’s good to find a suitable approach to find out the output. Here, we want total number of records available in data files. So the requirement is to how to find the number of records using Map Reduce.

Components Involved

  • HDFS: Here it is used to store the input data which will get pass to the Map Reduce as an input and also store the map reduce output.
  • Map Reduce: Here it is used to process the data/file available in HDFS and store the output to the HDFS.

Sample Data

Let’s have a look at the sample data, how it will look like:

sampledata
 
  1. id,first_name,last_name,gender,designation,city,country
  2. 1,Thayne,Mocher,Male,Administrative Officer,Port Colborne,Canada
  3. 2,Shelly,Bulfoot,Female,Analyst Programmer,Bacong,Philippines
  4. 3,Hercule,Chorlton,Male,Operator,Al Mazzunah,Tunisia
  5. 4,Thorstein,Epton,Male,Administrative Officer,Tayirove,Ukraine
  6. 5,Madelena,Savin,Female,Help Desk Technician,Jinjiang,China
  7. 6,Adeline,Imesson,Female,Legal Assistant,Fort Beaufort,South Africa
  8. 7,Celie,Richards,Male,Chemical Engineer,Dubiecko,Poland
  9. 8,Lilas,Harrowing,Female,Assistant Media Planner,Guayata,Colombia
  10. 9,Freida,Leivers,Female,Legal Assistant,Bangus Kulon,Indonesia
  11. 10,Celie,Dolligon,Female,Data Coordiator,Paraty,Brazil
  12. 11,Berkley,Orteaux,Male,Assistant Professor,Zinder,Niger
  13. 12,Gilburt,Minot,Male,Marketing Assistant,Hanyuan,China
  14. 13,Blaine,Treverton,Male,Research Associate,Yuankeng,China
  15. 14,Benjamen,Dodd,Male,Assistant Professor,Beberon,Philippines
  16. 15,Nikos,Worpole,Male,Human Resources Assistant II,Esmeralda,Cuba
  17. 16,Hercule,Richards,Male,Chemical Engineer,Dubiecko,Poland

Here, the sample data contains user’s info which is a comma(,) separated. You can download the sample data file from here.

recordCount

Solution

The solution has many steps from setting up to the output of the code to the execution and validates the output. Let’s go ahead one by one:

Step 1: Input Data Preparation

The first step includes data preparation. In real time scenario, you will have your data. The data preparation means that it should present at the location from where our map reduce job can get it as an input file.

Let’s keep the sample data file at the local path. In my case, file name is sampledataForDuplicate” and local path is /home/NN/HadoopRepo/MapReduce/resources/recordCount”.

We need to move this data file to an HDFS location which will be input path for the map reduce.

hdfs
 
  1. hadoop fs -put /home/NN/HadoopRepo/MapReduce/resources/recordCount /user/bdp/mapreduce

It will copy the data file from local to HDFS location.

Step 2: Create Maven Project

In order to write map reduce program, create a maven project. We will add the dependency for the required package.

Follow the below steps in order to create the maven project:

  • Open Eclipse
  • Create a maven project

Step 3: Resolve Dependency

Add below dependency in pom.XML file and resolve dependency using cmd:

pom.xml
 
  1. <dependency>
  2. <groupId>org.apache.hadoop</groupId>
  3. <artifactId>hadoop-common</artifactId>
  4. <version>2.7.1</version>
  5. </dependency>
  6. <!-- Hadoop Mapreduce Client Core -->
  7. <dependency>
  8. <groupId>org.apache.hadoop</groupId>
  9. <artifactId>hadoop-mapreduce-client-core</artifactId>
  10. <version>2.7.1</version>
  11. </dependency>
  12. <dependency>
  13. <groupId>jdk.tools</groupId>
  14. <artifactId>jdk.tools</artifactId>
  15. <version>${java.version}</version>
  16. <scope>system</scope>
  17. <systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>
  18. </dependency>

Step 4: Write Mapper

Once you are done with all above steps, write a mapper class which will take an input file. It will read the file and store each word of the file with key-value pair. Here using a java program to write the mapper.

Mapper
 
  1. package com.bdp.mapreduce.recordcount.mapper;
  2. import java.io.IOException;
  3. import org.apache.hadoop.io.IntWritable;
  4. import org.apache.hadoop.io.LongWritable;
  5. import org.apache.hadoop.io.Text;
  6. import org.apache.hadoop.mapreduce.Mapper;
  7. public class RecordCountMapper
  8.         extends Mapper<LongWritable, Text, Text, IntWritable>{
  9.         private static final IntWritable  one = new IntWritable(1);
  10.         private Text record = new Text("Record");
  11.         @Override
  12.         protected void map(LongWritable key, Text value,
  13.                 Mapper<LongWritable, Text, Text, IntWritable>.Context context)
  14.                 throws IOException, InterruptedException {
  15.                 // TODO Auto-generated method stub
  16.                 // Directly writing values which is nothing but a record
  17.                 if (key.get() == 0 && value.toString().contains("first_name")) {
  18.                         return;
  19.                 } else
  20.                         context.write(record, one);
  21.         }
  22. }

Step 5: Write Reducer

In this step, we will take the mapper data as an input and process it. The actual logic has been written here.

Find the code below:

Reducer
 
  1. package com.bdp.mapreduce.recordcount.reducer;
  2. import java.io.IOException;
  3. import org.apache.hadoop.io.IntWritable;
  4. import org.apache.hadoop.io.NullWritable;
  5. import org.apache.hadoop.io.Text;
  6. import org.apache.hadoop.mapreduce.Reducer;
  7. public class RecordCountReducer
  8.       extends Reducer<Text, IntWritable, NullWritable, IntWritable>{
  9.       @Override
  10.       protected void reduce(Text key, Iterable<IntWritable> values,
  11.             Reducer<Text, IntWritable, NullWritable, IntWritable>.Context context)
  12.             throws IOException, InterruptedException {
  13.             // TODO Auto-generated method stub
  14.             int recordCount = 0;
  15.             for (IntWritable value : values) {
  16.                   recordCount += value.get();
  17.             }
  18.             context.write( NullWritable.get(), new IntWritable(recordCount));
  19.       }
  20. }

Step 6: Write Driver

In order to execute the mapper and reducer, let’s create a driver class which will call mapper and reducer. Find below the driver class code:

Driver
 
  1. package com.bdp.mapreduce.recordcount.driver;
  2. import org.apache.hadoop.conf.Configured;
  3. import org.apache.hadoop.fs.Path;
  4. import org.apache.hadoop.io.IntWritable;
  5. import org.apache.hadoop.io.Text;
  6. import org.apache.hadoop.mapreduce.Job;
  7. import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
  8. import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
  9. import org.apache.hadoop.util.Tool;
  10. import org.apache.hadoop.util.ToolRunner;
  11. import com.bdp.mapreduce.recordcount.mapper.RecordCountMapper;
  12. import com.bdp.mapreduce.recordcount.reducer.RecordCountReducer;
  13. public class RecordCountDriver
  14.         extends Configured implements Tool{
  15.         public int run(String[] args) throws Exception {
  16.                 // TODO Auto-generated method stub
  17.                 @SuppressWarnings("deprecation")
  18.                 Job job = new Job(getConf(), "Record Count");
  19.                 job.setJarByClass(getClass());
  20.                
  21.                 job.setMapperClass(RecordCountMapper.class);
  22.                 job.setReducerClass(RecordCountReducer.class);
  23.                
  24.                 job.setMapOutputKeyClass(Text.class);
  25.                 job.setMapOutputValueClass(IntWritable.class);
  26.                
  27.                 FileInputFormat.addInputPath(job, new Path(args[0]));
  28.                 FileOutputFormat.setOutputPath(job, new Path(args[1]));
  29.                
  30.                 return job.waitForCompletion(true) ? 0 : 1;      
  31.         }
  32.        
  33.         public static void main(String[] args) throws Exception {
  34.                 int jobStatus = ToolRunner.run(new RecordCountDriver(), args);
  35.                 System.out.println(jobStatus);
  36.         }
  37. }

Step 7: Package Preparation

In this step, we will create a package(.jar) of the project. Follow the below steps:

  • Open CMD
  • Head to the maven project
  • Use the command: mvn package

You can also download the package which I have built for this requirement:

MapReduceForRecordCount-0.0.1-SNAPSHOT

Step 8: Execution

All the setup has been done. Let’s execute the job and validate the output. In order to execute the map reduce, use below command:
Format: hadoop jar <path of jar> <driver class with package name> <input data path of HDFS> <output path at HDFS>

Ex:
hadoop jar /home/NN/HadoopRepo/MapReduce/MapReduceForRecordCount-0.0.1-SNAPSHOT.jar com.bdp.mapreduce.recordcount.driver.RecordCountDriver /user/bdp/mapreduce/recordCount /user/bdp/mapreduce/out/recordCount

Step 9: Validate Output

Check the output at HDFS path.

output
 
  1. [root@NN hadoop-2.6.0]# hadoop fs -ls /user/bdp/mapreduce/out/recordCount
  2. 17/08/01 22:04:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  3. Found 2 items
  4. -rw-r--r--   1 root supergroup          0 2017-08-01 22:04 /user/bdp/mapreduce/out/recordCount/_SUCCESS
  5. -rw-r--r--   1 root supergroup          3 2017-08-01 22:04 /user/bdp/mapreduce/out/recordCount/part-r-00000
print output
 
  1. [root@NN hadoop-2.6.0]# hadoop fs -cat /user/bdp/mapreduce/out/recordCount/part-r-00000
  2. 17/08/01 22:05:11 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  3. 16

Wrapping Up

In this post, we have written a map reduce to count the number of records of the data file(s). This is similar to SQL count(*) command. Here, this MapReduce job contains three class map, reduce and driver. Mapper reads the input, Reducer counts the records and Driver running setting all the configuration to run the map reduce.

Load CSV file into hive AVRO table

Requirement You have comma separated(CSV) file and you want to create Avro table in hive on top of it, then ...
Read More

Load CSV file into hive PARQUET table

Requirement You have comma separated(CSV) file and you want to create Parquet table in hive on top of it, then ...
Read More

Hive Most Asked Interview Questions With Answers – Part II

What is bucketing and what is the use of it? Answer: Bucket is an optimisation technique which is used to ...
Read More
/ hive, hive interview, interview-qa

Spark Interview Questions Part-1

Suppose you have a spark dataframe which contains millions of records. You need to perform multiple actions on it. How ...
Read More

Leave a Reply