Gender Identifier in Python using NLTK

In this tutorial, we will learn about creating a gender identifier classification model using NLTK in Python. Natural Language Processing is one of the fascinating fields of study which helps the computer process the human language.

The task involves choosing the correct gender label for the given input names. We will be looking at supervised training of corpora that contains the correct label for each input.

The different steps we will follow for the Classifier model are :

  • Analyzing the pattern and building upon it a  function that acts as a feature extractor.
  • Extracting the feature from the input and returns it as a dictionary of feature set. We then train the model upon the feature dataset.
  • As per the Natural Language toolkit documentation the male name is likely to end in k,r,o,s & t whereas the female names are supposed to end in a,e, i. This information helps us to create a new feature set to build on the assumption that the last character of the names helps us to classify the inputs into the required gender.
  • As the next step to the classifier, the important step is to encode the relevant feature, in our case the final letter of the word.
  • A feature name represents a case-sensitive human-readable description of the feature.
  • Preparing the respective class-labels and the list of examples.
  • Using the feature extractor we process the data and split the resulting list into training and testing dataset respectively.

Model Implemenatation

We are going to start coding in Python to develop our gender identifier with NLTK. So start coding…

def features(word):
    return {"last_letter of word is ": word[-1]} 
features('peter')
{'last_letter of word is ': 'r'}

The output dictionary maps feature names to their values and are called the feature set. They provide case sensitive information about the feature in the human-readable format as in the example ‘last letter of the word’.

As the feature extraction completes, as a next step, we need to prepare class labels as list and divide the whole data into training and testing dataset:

if __name__ == "__main__":

    # Extract the data sets
    label_names = ([(name, "male") for name in names.words("male.txt")] +
                     [(name, "female") for name in names.words("female.txt")])

    print(len(label_names))

    # Shuffle the names in the list
    random.shuffle(label_names)

    # Process the names through feature extractor
    feature_sets = [(gender_features(n), gender)
                    for (n, gender) in label_names]

    train_set, test_set = feature_sets[500:], feature_sets[:500]

We now use the training dataset to train the NaiveBayesClassifier from the NLTK library:

classifier = nltk.NaiveBayesClassifier.train(train_set)

Now, let’s infer the model using data that are not there in the training dataset,

g1 = classifier.classify(gender_features('Rahul'))
g2 = classifier.classify(gender_features('Elizabeth'))

print("{} is most probably a {}.".format('Rahul',g1))
print("{} is most probably a {}.".format('Elizabeth',g2))
Rahul is most probably a male.
Elizabeth is most probably a female.

Now we can increase the dataset systematically to include more names and we can test the accuracy of the model:

# Test the accuracy of the classifier on the test data
print("\n Accuracy of the model is :",nltk.classify.accuracy(classifier, test_set)*100,"%\n")
Accuracy of the model is : 78.6 %

 

Finally, we can print out the details of the dataset that impact the model to classify the given data by:

print(classifier.show_most_informative_features(5))
Most Informative Features
last_letter = 'a' female : male = 33.4 : 1.0
last_letter = 'k' male : female = 31.8 : 1.0
last_letter = 'v' male : female = 18.7 : 1.0
last_letter = 'f' male : female = 16.6 : 1.0
last_letter = 'p' male : female = 12.6 : 1.0

Leave a Reply

Your email address will not be published. Required fields are marked *