Introduction to NLTK: Tokenization, Stemming, Lemmatization, POS Tagging
Hello everyone, In this tutorial, we’ll be learning about Natural Language Toolkit(NLTK) which is the most popular, open-source and a complete Python library for Natural Language Processing(NLP). It has support for the largest number of Human Languages as compared to others. After this tutorial, we will have a knowledge of many concepts in NLP including Tokenization, Stemming, Lemmatization, POS(Part-of-Speech) Tagging and will be able to do some Data Preprocessing. Let us start this tutorial with the installation of the NLTK library in our environment.
Install the NLTK library in the python environment using the following command.
pip install nltk
We are now ready to move forward and we want you to write code with us.
Importing & Downloading packages inside NLTK
import nltk nltk.download()
nltk.downlaod() will open an NLTK downloader in which we can download the packages of our choice. To avoid any error regarding import we recommend downloading all packages at once.
from nltk.stem import PorterStemmer,WordNetLemmatizer from nltk.util import ngrams from nltk.corpus import stopwords from nltk.tag import pos_tag
We will be using these imports for this tutorial and will get to learn about everyone as we move ahead in this tutorial.
Opening and Reading the text file
para = open(r'D:\VS_code_workspace\nltk_def.txt').read()
For this tutorial, we have taken the first few lines of NLTK definition from Wikipedia. You can work with any text file present on your system but note that the larger the file the more time it will take for processing.
This is the text that we use.
Tokenization: NLTK Python
Tokenization is the process of converting the corpse or the paragraph we have into sentences and words. This is the first step in NLP and is done because it is very difficult to process the whole corpus at once as there are words that just used to make the structure and are not giving any value to the data we want. We’ll be discussing these throughout the tutorial. Do follow the steps and try to analyze the output.
sentences = nltk.sent_tokenize(para) print(sentences) words = nltk.word_tokenize(para) print(words) grams_3 = list(ngrams(words,3)) print(grams_3)
We have used sent_tokenize() and word_tokenize() functions to make a list of sentences and words in our data respectively. We are doing this so that we can now process each word of the corpus and if needed can remove punctuation marks, numbers, etc which are not required and are just waste of memory. Now, we have also used a function ngrams() which will return the combo of words that we mentioned(3 in our case) from our data but we can use any number.
Stemming: NLTK Python
Stemming is the process of reduction and is carried out to process those words that are derived from the same root word. We generally use many forms of the same word like ‘lie’, ‘liar’, ‘lying’, etc, all having the same base or root i.e- lie. These words though have same value but our system will consider them as different and thus they can have different values from others. So we need to convert them in their root form which is done by stemming.
Note that in stemming the root word we get can be semantically incorrect, By this we mean that stemmed words may or may not have meaning. Like ‘Studies’ will get stemmed to Studi which is semantically incorrect for us, while ‘Studying’ will get reduced to Study which is a known word.
por_stem = PorterStemmer() stemmed_words = [por_stem.stem(word) for word in words] print(stemmed_words)
We have created an instance of PorterStemmer() which is the most popular stemmer and has created a list of all words from the tokenization after Stemming them.
Stopwords are the words that we most frequently used while structuring our data and they do not provide value to our sentence and removing is a good practice if we have a big data size. They are present in almost every human language and NLTK has a collection of those words in several languages. Some examples of Stopwords are – ‘a‘, ‘any‘, ‘during‘, ‘few‘ and many more.
We can check the Stopword by the following command and do try to see the stopwords in many other languages.
Lemmatization: NLTK Python
It is similar to Stemming but the Base word or Root word in this is semantically correct or meaningful. It is useful when we are concerned with the semantics of the text that we have. But note that Lemmatization is slower than Stemming.
word_lemma = WordNetLemmatizer() Lemmatized_words = [word_lemma.lemmatize(word).lower() for word in words if word.isalpha() and word not in set(stopwords.words('english'))] print(Lemmatized_words)
To understand the code above we recommend you to know about list comprehension. You can read this tutorial on list comprehension.
Firstly we have Lemmatize each word which is present in ‘words’ and apply if conditions that the words must have alphabets( using word.isalpha()) and are not present in Stopwords.
Run the code and see the difference between Stemmed words and Lemmatized words.
Part-Of-Speech Tagging in NLTK with Python
This section teaches us how can we know that in each word falls under which POS Category.
pos = pos_tag(Lemmatized_words) print(pos)
The above code will give us an output in which each word will have the POS Category with that like JJ, NN, VBZ, VBG, etc many more. To know more about what these tags represent just run the following command.
That’s all for this tutorial,We hope you really enjoyed this tutorial and feel free to comment below if you have any doubt.
You may like to learn