Chunking Rules in NLP

Post Views: 1,402

To be able to gain more information from a text in Natural Language Processing, we preprocess the text using various techniques such as stemming/ lemmatization, ‘stopwords’ removal, Part_Of_Speech (POS) tagging, etc. Another such technique that can be used is chunking which allows us to extract the important phrases present in our text. This article will help you understand what chunking is and how to implement the same in Python.

Chunking in NLP

Chunking is the process of extracting a group of words or phrases from an unstructured text. The chunk that is desired to be extracted is specified by the user. It can be applied only after the application of POS_tagging to our text as it takes these POS_tags as input and then outputs the extracted chunks. One of the main applications of chunking is to extract named entities from a text. This includes information such as person names, company names, locations, etc.

Chunking Rules in NLP

First, we perform tokenization where we split a sentence into its corresponding words.

We then apply POS_tagging to label each word with its appropriate part of speech. The list of POS_tags in NLTK with examples is shown below:

 CC    coordinating conjunction 
 CD    cardinal digit 
 DT    determiner 
 EX    existential there (like: “there is” ) 
 FW    foreign word 
 IN    preposition/subordinating conjunction 
 JJ    adjective ‘cheap’ 
 JJR   adjective, comparative ‘cheaper’ 
 JJS   adjective, superlative ‘cheapest’ 
 LS    list item marker 1. 
 MD    modal could, will 
 NN    noun, singular ‘table’ 
 NNS   noun plural ‘undergraduates’ 
 NNP   proper noun, singular ‘Rohan' 
 NNPS  proper noun, plural ‘Indians’ 
 PDT   predeterminer ‘all the kids’ 
 POS   possessive ending parent‘s 
 PRP   personal pronoun I, she, him 
 PRP$  possessive pronoun my, hers 
 RB    adverb occasionaly, silently, 
 RBR   adverb, comparative better 
 RBS   adverb, superlative best 
 RP    particle give up 
 TO    to go ‘to‘ the mall. 
 UH    interjection Goodbye 
 VB    verb, ask 
 VBD   verb, swiped 
 VBG   verb, focussing 
 VBN   verb, past participle 
 VBP   verb, present tense, sing, not 3rd person singular 
 VBZ   verb, present tense, 3rd person singular 
 WDT   wh-determiner which 
 WP    wh-pronoun who, that 
 WP$   possessive wh-pronoun whose 
 WRB   wh-abverb where, how, however

The chunk to be extracted is defined using regex (regular expressions) along with the POS_tags. From regex, we’ll mainly use the following:

? = 0 or 1 match of the preceding expression
* = 0 or more match of the preceding expression
+ = 1 or more match of the preceding expression  
. = specifies any single character except a new line character

For e.g. to extract all the proper nouns present in a sentence, one of the chunks that can be used is r”’ Chunk: {<DT>*<NNP>*<NN>*} ”’ (where ‘<>’ denotes a POS_tag). The format used is how you should define your chunk. Also, keep in mind you’ll have to define your chunk depending on your text.
Once it is defined, we extract the chunks present in our sentence using RegexpParser from NLTK which takes the tagged_words (i.e. the POS_tags) as its input.

Implementation: Chunking in NLP using Python

Now, let us try to extract all the noun phrases from a sentence using the steps defined above. First, we’ll import the required libraries and then tokenize the sentence before applying POS_tagging to it.

# Importing the required libraries
import nltk
from nltk import pos_tag
from nltk import word_tokenize
from nltk import RegexpParser

# Example sentence
text = " The Air India flight to Delhi was ready to board."

# Splitiing the sentence into words
list_of_words = word_tokenize(text)

# Applying POS_tagging
tagged_words = pos_tag(list_of_words)

We then define our chunk keeping in mind that our aim is to extract all the noun phrases present in our sentence.

# Extracting the Noun Phrases
chunk_to_be_extracted = r''' Chunk: {<DT>*<NNP>*<NN>*} '''

# Applying chunking to the text
chunkParser = nltk.chunk.RegexpParser(chunk_to_be_extracted)
chunked_sentence = chunkParser.parse(tagged_words)

The ‘chunked_sentence’ variable is an NLTK tree which can be viewed using the draw() method.

# To view the NLTK tree
chunked_sentence.draw()

OUTPUT:

Chunking Rules in NLP

To view the chunks obtained, we iterate through the subtrees of the NLTK tree as these subtrees consist of the chunks and the non-chunks. We do so using the subtree() and label() method.

# To print the chunks extracted

print('Chunks obtained: \n')
for subtree in chunked_sentence.subtrees():
    if subtree.label() == 'Chunk':
        print(subtree)

OUTPUT:

Chunks obtained:

(Chunk The/DT Air/NNP India/NNP flight/NN)
(Chunk Delhi/NNP)
(Chunk board/NN)

You can try extracting other phrases from your sentence by defining your own chunk i.e. the ‘chunk_to_be_extracted’ variable.

Chunking Rules in NLP

Chunking in NLP

Chunking Rules in NLP

Implementation: Chunking in NLP using Python

Leave a Reply Cancel reply