Creating User Defined Functions(UDFs) in Pig(for Mapreduce Framework)

In this Java tutorial, we will learn how to create user-defined functions in Pig for MapReduce Framework in Java/

Pig is mainly used to batch process variety of non-structured data. It is an abstraction layer over the MapReduce Framework and provides all key features of the original Pig-Latin scripting language, the Grunt-shell(path using the original MapReduce and the locally existing engine) and the User Defined Functions(UDFs) for extending Pig.

Create User-defined functions in Pig for MapReduce Framework in Java

Here we are going to see an example of UDF to be implemented in Pig. For this, we will create a jar file with class CharCount inheriting the EvalFunc<Integer> class and then override exec(Tuple t) function where t carries the arguments to be used in form of a tuple. These values can be easily accessed using get() function. We will need to import org.apache.pig.EvalFunc and org.apache.pig.data.Tuple package.

package lko;

import java.io.IOException;

import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class CharCount extends EvalFunc<Integer> {
@Override
public Integer exec(Tuple t) throws IOException {
  String fv =(String)t.get(0);
  String mv = (String)t.get(1);
  fv = fv.toLowerCase();
  mv = mv.toLowerCase();
  char [] data =fv.toCharArray();
  char ch = mv.charAt(0);
  int count =0;
  for(int i=0;i<data.length;++i) {
    if(data[i]==ch) {
      ++count;
    }
  }
  return count;
}
}

In the taken example, we shall pass the string and letter as the arguments. We will thus find the frequency of the passed letter in the passed string itself and then return it.

After the jar file has been created, we will register it on the grunt shell using:

> register jar_name.jar

and then use it in form of the package_name.class_name(arg0,arg1,..)

To use the above code in our shell after registering its jar file, We will simply write in the grunt shell :

> A = foreach R1 generate name,lko.CharCount(name,'a')

Also, We can define the particular class as some temporary UDF in the manner:-



> define charcount lko.CharCount()
> A = foreach R1 generate name,charcount(name,'a')

Thus A relation shall contain string name and number of occurrences of ‘A’ in it.

also, learn,

Leave a Reply

Your email address will not be published. Required fields are marked *