Chunking Rules in NLP

To be able to gain more information from a text in Natural Language Processing, we preprocess the text using various techniques such as stemming/ lemmatization, ‘stopwords’ removal, Part_Of_Speech (POS) tagging, etc. Another such technique that can be used is chunking which allows us to extract the important phrases present in our text. This article will help you understand what chunking is and how to implement the same in Python.

Chunking in NLP

Chunking is the process of extracting a group of words or phrases from an unstructured text. The chunk that is desired to be extracted is specified by the user. It can be applied only after the application of POS_tagging to our text as it takes these POS_tags as input and then outputs the extracted chunks. One of the main applications of chunking is to extract named entities from a text. This includes information such as person names, company names, locations, etc.

 Chunking Rules in NLP

  1. First, we perform tokenization where we split a sentence into its corresponding words.
  2. We then apply POS_tagging to label each word with its appropriate part of speech. The list of  POS_tags in NLTK with examples is shown below:
     CC    coordinating conjunction 
     CD    cardinal digit 
     DT    determiner 
     EX    existential there (like: “there is” ) 
     FW    foreign word 
     IN    preposition/subordinating conjunction 
     JJ    adjective ‘cheap’ 
     JJR   adjective, comparative ‘cheaper’ 
     JJS   adjective, superlative ‘cheapest’ 
     LS    list item marker 1. 
     MD    modal could, will 
     NN    noun, singular ‘table’ 
     NNS   noun plural ‘undergraduates’ 
     NNP   proper noun, singular ‘Rohan' 
     NNPS  proper noun, plural ‘Indians’ 
     PDT   predeterminer ‘all the kids’ 
     POS   possessive ending parent‘s 
     PRP   personal pronoun I, she, him 
     PRP$  possessive pronoun my, hers 
     RB    adverb occasionaly, silently, 
     RBR   adverb, comparative better 
     RBS   adverb, superlative best 
     RP    particle give up 
     TO    to go ‘to‘ the mall. 
     UH    interjection Goodbye 
     VB    verb, ask 
     VBD   verb, swiped 
     VBG   verb, focussing 
     VBN   verb, past participle 
     VBP   verb, present tense, sing, not 3rd person singular 
     VBZ   verb, present tense, 3rd person singular 
     WDT   wh-determiner which 
     WP    wh-pronoun who, that 
     WP$   possessive wh-pronoun whose 
     WRB   wh-abverb where, how, however
  3. The chunk to be extracted is defined using regex (regular expressions) along with the POS_tags. From regex, we’ll mainly use the following:
    ? = 0 or 1 match of the preceding expression
    * = 0 or more match of the preceding expression
    + = 1 or more match of the preceding expression  
    . = specifies any single character except a new line character
    
  4. For e.g. to extract all the proper nouns present in a sentence, one of the chunks that can be used is r”’ Chunk: {<DT>*<NNP>*<NN>*} ”’  (where ‘<>’ denotes a POS_tag). The format used is how you should define your chunk. Also, keep in mind you’ll have to define your chunk depending on your text.
  5. Once it is defined, we extract the chunks present in our sentence using RegexpParser from NLTK which takes the tagged_words (i.e. the POS_tags) as its input.

Implementation: Chunking in NLP using Python

Now, let us try to extract all the noun phrases from a sentence using the steps defined above. First, we’ll import the required libraries and then tokenize the sentence before applying POS_tagging to it.

# Importing the required libraries
import nltk
from nltk import pos_tag
from nltk import word_tokenize
from nltk import RegexpParser

# Example sentence
text = " The Air India flight to Delhi was ready to board."

# Splitiing the sentence into words
list_of_words = word_tokenize(text)

# Applying POS_tagging
tagged_words = pos_tag(list_of_words)

We then define our chunk keeping in mind that our aim is to extract all the noun phrases present in our sentence.

# Extracting the Noun Phrases
chunk_to_be_extracted = r''' Chunk: {<DT>*<NNP>*<NN>*} '''

# Applying chunking to the text
chunkParser = nltk.chunk.RegexpParser(chunk_to_be_extracted)
chunked_sentence = chunkParser.parse(tagged_words)

The ‘chunked_sentence’ variable is an NLTK tree which can be viewed using the draw() method.

# To view the NLTK tree
chunked_sentence.draw() 

OUTPUT:

Chunking Rules in NLP

To view the chunks obtained, we iterate through the subtrees of the NLTK tree as these subtrees consist of the chunks and the non-chunks. We do so using the subtree() and label() method.

# To print the chunks extracted

print('Chunks obtained: \n')
for subtree in chunked_sentence.subtrees():
    if subtree.label() == 'Chunk':
        print(subtree)
        

OUTPUT:

Chunks obtained:
(Chunk The/DT Air/NNP India/NNP flight/NN)
(Chunk Delhi/NNP)
(Chunk board/NN)

You can try extracting other phrases from your sentence by defining your own chunk i.e. the ‘chunk_to_be_extracted’ variable.

Also read:

  1. How to choose number of epochs to train a neural network in Keras
  2. Introduction to Natural Language Processing- NLP
  3. Introduction to NLTK: Tokenization, Stemming, Lemmatization, POS Tagging

Leave a Reply

Your email address will not be published. Required fields are marked *