WordCount problem for Mapreduce Framework in Java over a HDFS cluster.

Post Views: 856

Since the publication of the Mapreduce Research paper by Sanjay Ghemawat and Jeffrey Dean at Google Inc, its usage in the Big Data and Data Science industry has been quite pragmatic and expansive.

Mapreduce Framework and its APIs since their inception have come into different versions of MRv1 and MRv2.Mapreduce Engine acts over data stored in HDFS and can act on data of size ranging upto 10 PB.

WordCount problem for Mapreduce Framework in Java

Here we are going to go through the problem of WordCount which is generally regarded as basic problem for understanding the Mapreduce model. We will use MR1 APIs for this problem. We will create a jar file which we will run in the end consisting of three different classes :-

MyMapper.java
MyReducer.java
MyClient.java

MyMapper.java

In this class we will override the Map function and produce our key as Text(Modified String wrapper class for hadoop) and value as an iterator of IntWritable(Modified Integer wrapper class for hadoop). We will make key for every different word and give its value as 1 whenever it is found thus obtaining an iterator of such values.

package wordcount;

import org.apache.hadoop.mapred.*;

import java.io.IOException;

import org.apache.hadoop.io.*;


public class MyMapper extends MapReduceBase implements Mapper<LongWritable,Text,Text,IntWritable> {

  @Override
  public void map(LongWritable offset, Text line, OutputCollector<Text, IntWritable> op, Reporter arg3)
      throws IOException {
    // TODO Auto-generated method stub
    String data=line.toString();
    String [] words=data.split("[ ,]");
    for(int i=0;i<words.length;i++)
    {
      Text opk=new Text(words[i]);
      IntWritable opv =new IntWritable(1);
      op.collect(opk, opv);
    }
    
  }

  

}

MyReducer.java

In this class we sum up the values in the iterator to obtain the frequency of each word.

package wordcount;

import org.apache.hadoop.mapred.*;

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.io.*;


public class MyReducer extends MapReduceBase implements Reducer<Text,IntWritable,Text,IntWritable> {

  @Override
  public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> op, Reporter arg3)
      throws IOException {
    // TODO Auto-generated method stub
    int sum=0;
    while(values.hasNext()) {
      IntWritable data =values.next();
      int i=data.get();
      sum = sum + i;
    }
    Text opk=key;
    IntWritable opy =new IntWritable(sum);
    op.collect(opk, opy);
    
  }

}

Learn also,

MyClient.java

This class acts as the driver class and helps in defining role of classes so as to assign each of MyMapper.class and MyReducer.class their particular work. It also mentions all other required details for the Mapreduce model to act on such as type classes required for input,output etc.

package wordcount;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.util.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.fs.*;
public class MyClient extends Configured implements Tool{

  @Override
  public int run(String[] arg) throws Exception {
    // TODO Auto-generated method stub
    JobConf jc=new JobConf(MyClient.class);
    Path input_path_file = new Path(arg[0]);// hdfs existing file path
    Path output_path_dir = new Path(arg[1]);// hdfs non existing dir path
    FileInputFormat.setInputPaths(jc, input_path_file);
    jc.setMapperClass(MyMapper.class);
    jc.setMapOutputKeyClass(Text.class);
    jc.setMapOutputValueClass(IntWritable.class);
    jc.setReducerClass(MyReducer.class);
    jc.setOutputKeyClass(Text.class);
    jc.setOutputValueClass(IntWritable.class);
    FileOutputFormat.setOutputPath(jc, output_path_dir);
    JobClient.runJob(jc);
    return 0;
  }
  public static void main(String [] arg)throws Exception
  {
    MyClient tool = new MyClient();
    ToolRunner.run(tool, arg);
  }

}

Finally we need to make a jar file consisting of these three classes.MyClient. java shall act as the driver class for the jar. Then we just need to run the following command on the console :-
hadoop jar <jar_name> <package_name.driver_class> <input_path_of_data_file > <output_directory>

For the given example wordcount is the name of the package and MyClient is the name of the driver class.