WordCount problem for Mapreduce Framework in Java over a HDFS cluster.
Since the publication of the Mapreduce Research paper by Sanjay Ghemawat and Jeffrey Dean at Google Inc, its usage in the Big Data and Data Science industry has been quite pragmatic and expansive.
Mapreduce Framework and its APIs since their inception have come into different versions of MRv1 and MRv2.Mapreduce Engine acts over data stored in HDFS and can act on data of size ranging upto 10 PB.
WordCount problem for Mapreduce Framework in Java
Here we are going to go through the problem of WordCount which is generally regarded as basic problem for understanding the Mapreduce model. We will use MR1 APIs for this problem. We will create a jar file which we will run in the end consisting of three different classes :-
- MyMapper.java
- MyReducer.java
- MyClient.java
MyMapper.java
In this class we will override the Map function and produce our key as Text(Modified String wrapper class for hadoop) and value as an iterator of IntWritable(Modified Integer wrapper class for hadoop). We will make key for every different word and give its value as 1 whenever it is found thus obtaining an iterator of such values.
package wordcount; import org.apache.hadoop.mapred.*; import java.io.IOException; import org.apache.hadoop.io.*; public class MyMapper extends MapReduceBase implements Mapper<LongWritable,Text,Text,IntWritable> { @Override public void map(LongWritable offset, Text line, OutputCollector<Text, IntWritable> op, Reporter arg3) throws IOException { // TODO Auto-generated method stub String data=line.toString(); String [] words=data.split("[ ,]"); for(int i=0;i<words.length;i++) { Text opk=new Text(words[i]); IntWritable opv =new IntWritable(1); op.collect(opk, opv); } } }
MyReducer.java
In this class we sum up the values in the iterator to obtain the frequency of each word.
package wordcount; import org.apache.hadoop.mapred.*; import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.io.*; public class MyReducer extends MapReduceBase implements Reducer<Text,IntWritable,Text,IntWritable> { @Override public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> op, Reporter arg3) throws IOException { // TODO Auto-generated method stub int sum=0; while(values.hasNext()) { IntWritable data =values.next(); int i=data.get(); sum = sum + i; } Text opk=key; IntWritable opy =new IntWritable(sum); op.collect(opk, opy); } }
Learn also,
- Working with HashMap (Insertion, Deletion, Iteration) in Java
- Print Pattern – Codechef December Lunchtime 2018 Problem in Java
MyClient.java
This class acts as the driver class and helps in defining role of classes so as to assign each of MyMapper.class and MyReducer.class their particular work. It also mentions all other required details for the Mapreduce model to act on such as type classes required for input,output etc.
package wordcount; import org.apache.hadoop.conf.*; import org.apache.hadoop.util.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.fs.*; public class MyClient extends Configured implements Tool{ @Override public int run(String[] arg) throws Exception { // TODO Auto-generated method stub JobConf jc=new JobConf(MyClient.class); Path input_path_file = new Path(arg[0]);// hdfs existing file path Path output_path_dir = new Path(arg[1]);// hdfs non existing dir path FileInputFormat.setInputPaths(jc, input_path_file); jc.setMapperClass(MyMapper.class); jc.setMapOutputKeyClass(Text.class); jc.setMapOutputValueClass(IntWritable.class); jc.setReducerClass(MyReducer.class); jc.setOutputKeyClass(Text.class); jc.setOutputValueClass(IntWritable.class); FileOutputFormat.setOutputPath(jc, output_path_dir); JobClient.runJob(jc); return 0; } public static void main(String [] arg)throws Exception { MyClient tool = new MyClient(); ToolRunner.run(tool, arg); } }
Finally we need to make a jar file consisting of these three classes.MyClient. java shall act as the driver class for the jar. Then we just need to run the following command on the console :-
hadoop jar <jar_name> <package_name.driver_class> <input_path_of_data_file > <output_directory>
For the givenĀ example wordcount is the name of the package and MyClient is the name of the driver class.
Learn also,
- Java solution to maximize the chances of winning in Monty-Hall Paradox Problem
- Tower of Hanoi Problem using recursion in Java
I did not read it all, but nice bro .
And I am sad, because you did not told me about this.
Thanks for your positive response.
You can also write on CodeSpeedy. Just contact us.