Chunking Rules in NLP
To be able to gain more information from a text in Natural Language Processing, we preprocess the text using various techniques such as stemming/ lemmatization, ‘stopwords’ removal, Part_Of_Speech (POS) tagging, etc. Another such technique that can be used is chunking which allows us to extract the important phrases present in our text. This article will help you understand what chunking is and how to implement the same in Python.
Chunking in NLP
Chunking is the process of extracting a group of words or phrases from an unstructured text. The chunk that is desired to be extracted is specified by the user. It can be applied only after the application of POS_tagging to our text as it takes these POS_tags as input and then outputs the extracted chunks. One of the main applications of chunking is to extract named entities from a text. This includes information such as person names, company names, locations, etc.
Chunking Rules in NLP
- First, we perform tokenization where we split a sentence into its corresponding words.
- We then apply POS_tagging to label each word with its appropriate part of speech. The list of POS_tags in NLTK with examples is shown below:
CC coordinating conjunction CD cardinal digit DT determiner EX existential there (like: “there is” ) FW foreign word IN preposition/subordinating conjunction JJ adjective ‘cheap’ JJR adjective, comparative ‘cheaper’ JJS adjective, superlative ‘cheapest’ LS list item marker 1. MD modal could, will NN noun, singular ‘table’ NNS noun plural ‘undergraduates’ NNP proper noun, singular ‘Rohan' NNPS proper noun, plural ‘Indians’ PDT predeterminer ‘all the kids’ POS possessive ending parent‘s PRP personal pronoun I, she, him PRP$ possessive pronoun my, hers RB adverb occasionaly, silently, RBR adverb, comparative better RBS adverb, superlative best RP particle give up TO to go ‘to‘ the mall. UH interjection Goodbye VB verb, ask VBD verb, swiped VBG verb, focussing VBN verb, past participle VBP verb, present tense, sing, not 3rd person singular VBZ verb, present tense, 3rd person singular WDT wh-determiner which WP wh-pronoun who, that WP$ possessive wh-pronoun whose WRB wh-abverb where, how, however
- The chunk to be extracted is defined using regex (regular expressions) along with the POS_tags. From regex, we’ll mainly use the following:
? = 0 or 1 match of the preceding expression * = 0 or more match of the preceding expression + = 1 or more match of the preceding expression . = specifies any single character except a new line character
- For e.g. to extract all the proper nouns present in a sentence, one of the chunks that can be used is r”’ Chunk: {<DT>*<NNP>*<NN>*} ”’ (where ‘<>’ denotes a POS_tag). The format used is how you should define your chunk. Also, keep in mind you’ll have to define your chunk depending on your text.
- Once it is defined, we extract the chunks present in our sentence using RegexpParser from NLTK which takes the tagged_words (i.e. the POS_tags) as its input.
Implementation: Chunking in NLP using Python
Now, let us try to extract all the noun phrases from a sentence using the steps defined above. First, we’ll import the required libraries and then tokenize the sentence before applying POS_tagging to it.
# Importing the required libraries import nltk from nltk import pos_tag from nltk import word_tokenize from nltk import RegexpParser # Example sentence text = " The Air India flight to Delhi was ready to board." # Splitiing the sentence into words list_of_words = word_tokenize(text) # Applying POS_tagging tagged_words = pos_tag(list_of_words)
We then define our chunk keeping in mind that our aim is to extract all the noun phrases present in our sentence.
# Extracting the Noun Phrases chunk_to_be_extracted = r''' Chunk: {<DT>*<NNP>*<NN>*} ''' # Applying chunking to the text chunkParser = nltk.chunk.RegexpParser(chunk_to_be_extracted) chunked_sentence = chunkParser.parse(tagged_words)
The ‘chunked_sentence’ variable is an NLTK tree which can be viewed using the draw() method.
# To view the NLTK tree chunked_sentence.draw()
OUTPUT:
To view the chunks obtained, we iterate through the subtrees of the NLTK tree as these subtrees consist of the chunks and the non-chunks. We do so using the subtree() and label() method.
# To print the chunks extracted print('Chunks obtained: \n') for subtree in chunked_sentence.subtrees(): if subtree.label() == 'Chunk': print(subtree)
OUTPUT:
Chunks obtained:
(Chunk The/DT Air/NNP India/NNP flight/NN) (Chunk Delhi/NNP) (Chunk board/NN)
You can try extracting other phrases from your sentence by defining your own chunk i.e. the ‘chunk_to_be_extracted’ variable.
Also read:
Leave a Reply