Proper Noun Extraction in Python using NLP in Python

learn block text

Natural Language Processing is a field of Artificial Intelligence that enables machines to process, interpret, and understand human language.

Pythons NLTK i.e. the Natural Language ToolKit has a number of robust functions that allow us to extract various information from a text. This article will help you understand how you can extract all the proper nouns present in a text using NLP in Python.

Python program for Proper noun extraction using NLP

Proper nouns identify specific people, places, and things. Extracting entities such as the proper nouns make it easier to mine data. For e.g. we can perform named entity extraction, where an algorithm takes a string of text (sentence or paragraph) as input and identifies the relevant nouns (people, places, and organizations) present in it.

 POS tagging

Part of Speech tagging (i.e. POS tagging) is the process of labeling each word in a sentence with its appropriate part of speech.
The POS tagger in python takes a list of words or sentences as input and outputs a list of tuples where each tuple is of the form (word, tag) where the tag indicates the part of speech associated with that word e.g. proper noun, verb, etc. The list of tags in python with examples is shown below:

CC    coordinating conjunction
CD    cardinal digit
DT    determiner
EX    existential there (like: “there is” )
FW    foreign word
IN    preposition/subordinating conjunction
JJ    adjective ‘cheap’
JJR   adjective, comparative ‘cheaper’
JJS   adjective, superlative ‘cheapest’
LS    list item marker 1.
MD    modal could, will
NN    noun, singular ‘table’
NNS   noun plural ‘undergraduates’
NNP   proper noun, singular ‘Rohan'
NNPS  proper noun, plural ‘Indians’
PDT   predeterminer ‘all the kids’
POS   possessive ending parent‘s
PRP   personal pronoun I, she, him
PRP$  possessive pronoun my, hers
RB    adverb occasionaly, silently,
RBR   adverb, comparative better
RBS   adverb, superlative best
RP    particle give up
TO    to go ‘to‘ the mall.
UH    interjection Goodbye
VB    verb, ask
VBD   verb, swiped
VBG   verb, focussing
VBN   verb, past participle
VBP   verb, present tense, sing, not 3rd person singular
VBZ   verb, present tense, 3rd person singular
WDT   wh-determiner which
WP    wh-pronoun who, that
WP$   possessive wh-pronoun whose
WRB   wh-abverb where, how, however

POS tagging example:

INPUT:

'Michael is his mentor'

OUTPUT:

[('Michael', 'NNP'), ('mentor', 'NN')]

Code | ProperNoun extraction

In order to run the Python code below, you must have NLTK and its associated packages installed. You can refer to the link for installation: How to install NLTK.  To download all its packages, in your environment (e.g. Spyder) type:

  • import ntlk
  • nltk.download()

A GUI will pop up, select “all” to download all packages, and then click ‘download’. Wait till the installation is complete.

Image_2 to be added

# Importing the required libraries
import nltk 
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize, sent_tokenize

First, we import all the required libraries. ‘stopwords’ is a list of words that do not add much meaning to the sentence (e.g. ‘a’, ‘but’). ‘word_tokenize’ splits up a sentence into its tokens i.e. words and punctuations whereas ‘sent_tokenize’ splits up a paragraph into its respective sentences.

# Function to extract the proper nouns 

def ProperNounExtractor(text):
    
    print('PROPER NOUNS EXTRACTED :')
    
    sentences = nltk.sent_tokenize(text)
    for sentence in sentences:
        words = nltk.word_tokenize(sentence)
        words = [word for word in words if word not in set(stopwords.words('english'))]
        tagged = nltk.pos_tag(words)
        for (word, tag) in tagged:
            if tag == 'NNP': # If the word is a proper noun
                print(word)

In the above function, we first split a paragraph into a list of sentences. Then for every sentence in the list ‘sentences’, we split the sentence into a list of words. We remove all the stopwords from the list ‘words’ and then apply POS tagging using nltk.pos_tag to each word in the list to be able to label every word with its respective part of speech i.e. the tag.

text =  "Rohan is a wonderful player. He was born in India. He is a fan of the movie Wolverine. He has a dog named Bruno."

# Calling the ProperNounExtractor function to extract all the proper nouns from the given text. 
ProperNounExtractor(text)

OUTPUT:

PROPER NOUNS EXTRACTED :
Rohan
India
Wolverine
Bruno

You can also try extracting any other POS from a text simply by replacing ‘NNP’ in tag == ‘NNP’ with your desired POS.

Also read:

 

Leave a Reply