Proper Noun Extraction in Python using NLP in Python
Natural Language Processing is a field of Artificial Intelligence that enables machines to process, interpret, and understand human language.
Pythons NLTK i.e. the Natural Language ToolKit has a number of robust functions that allow us to extract various information from a text. This article will help you understand how you can extract all the proper nouns present in a text using NLP in Python.
Python program for Proper noun extraction using NLP
Proper nouns identify specific people, places, and things. Extracting entities such as the proper nouns make it easier to mine data. For e.g. we can perform named entity extraction, where an algorithm takes a string of text (sentence or paragraph) as input and identifies the relevant nouns (people, places, and organizations) present in it.
Part of Speech tagging (i.e. POS tagging) is the process of labeling each word in a sentence with its appropriate part of speech.
The POS tagger in python takes a list of words or sentences as input and outputs a list of tuples where each tuple is of the form (word, tag) where the tag indicates the part of speech associated with that word e.g. proper noun, verb, etc. The list of tags in python with examples is shown below:
CC coordinating conjunction CD cardinal digit DT determiner EX existential there (like: “there is” ) FW foreign word IN preposition/subordinating conjunction JJ adjective ‘cheap’ JJR adjective, comparative ‘cheaper’ JJS adjective, superlative ‘cheapest’ LS list item marker 1. MD modal could, will NN noun, singular ‘table’ NNS noun plural ‘undergraduates’ NNP proper noun, singular ‘Rohan' NNPS proper noun, plural ‘Indians’ PDT predeterminer ‘all the kids’ POS possessive ending parent‘s PRP personal pronoun I, she, him PRP$ possessive pronoun my, hers RB adverb occasionaly, silently, RBR adverb, comparative better RBS adverb, superlative best RP particle give up TO to go ‘to‘ the mall. UH interjection Goodbye VB verb, ask VBD verb, swiped VBG verb, focussing VBN verb, past participle VBP verb, present tense, sing, not 3rd person singular VBZ verb, present tense, 3rd person singular WDT wh-determiner which WP wh-pronoun who, that WP$ possessive wh-pronoun whose WRB wh-abverb where, how, however
POS tagging example:
'Michael is his mentor'
[('Michael', 'NNP'), ('mentor', 'NN')]
Code | ProperNoun extraction
In order to run the Python code below, you must have NLTK and its associated packages installed. You can refer to the link for installation: How to install NLTK. To download all its packages, in your environment (e.g. Spyder) type:
- import ntlk
A GUI will pop up, select “all” to download all packages, and then click ‘download’. Wait till the installation is complete.
# Importing the required libraries import nltk from nltk.corpus import stopwords from nltk.tokenize import word_tokenize, sent_tokenize
First, we import all the required libraries. ‘stopwords’ is a list of words that do not add much meaning to the sentence (e.g. ‘a’, ‘but’). ‘word_tokenize’ splits up a sentence into its tokens i.e. words and punctuations whereas ‘sent_tokenize’ splits up a paragraph into its respective sentences.
# Function to extract the proper nouns def ProperNounExtractor(text): print('PROPER NOUNS EXTRACTED :') sentences = nltk.sent_tokenize(text) for sentence in sentences: words = nltk.word_tokenize(sentence) words = [word for word in words if word not in set(stopwords.words('english'))] tagged = nltk.pos_tag(words) for (word, tag) in tagged: if tag == 'NNP': # If the word is a proper noun print(word)
In the above function, we first split a paragraph into a list of sentences. Then for every sentence in the list ‘sentences’, we split the sentence into a list of words. We remove all the stopwords from the list ‘words’ and then apply POS tagging using nltk.pos_tag to each word in the list to be able to label every word with its respective part of speech i.e. the tag.
text = "Rohan is a wonderful player. He was born in India. He is a fan of the movie Wolverine. He has a dog named Bruno." # Calling the ProperNounExtractor function to extract all the proper nouns from the given text. ProperNounExtractor(text)
PROPER NOUNS EXTRACTED : Rohan India Wolverine Bruno
You can also try extracting any other POS from a text simply by replacing ‘NNP’ in tag == ‘NNP’ with your desired POS.
- Introduction to Natural Language Processing- NLP
- Introduction to NLTK: Tokenization, Stemming, Lemmatization, POS Tagging
- Improving Accuracy Of Machine Learning Model in Python