Word Collocations in NLP
In this tutorial, We will learn about Collocation in Natural Language Processing with Python program.
We will start this tutorial with a question What is collocation?
Collocation is the expression of multiple words that frequently occur together in the corpus.
Let’s say we have a collection of texts (called corpus, plural corpora), related to Machine Learning. We will see (‘machine’, ‘learning’), (‘artificial’, ‘intelligence’) are appearing together more frequently i.e. they are highly collocated.
Mostly we use Bigrams and Trigrams Collocation for our filtering. Also, we always try to get meaningful frequently occurs terms together for more helpful for a good selection.
Methods for generating Bigrams
Pre-Knowledge: Basic nltk, python
Importing required Libraries
import nltk import nltk.collocations import nltk.corpus import collections
Let’s say we have a small collection of words (see the first paragraph of this page) and name it as example.txt. For that, we will Use Likelihood Ratios(Hypothesis) for finding bigrams and trigrams.
Likelihood Ratios: A number that tells us how much more likely one hypothesis is than the other.
webtext =nltk.corpus.webtext bigrams = nltk.collocations.BigramAssocMeasures() words = [w.lower() for w in webtext.words('./example.txt')] bigramFinder = nltk.collocations.BigramCollocationFinder.from_words(words) bigramratio = bigramFinder.score_ngrams( bigrams.likelihood_ratio ) bigramratio[:9]
[(('machine', 'learning'), 27.06904985133383), (('it', 'is'), 15.868742980364372), (('"', 'training'), 11.240109508056893), (('(', 'ml'), 11.240109508056893), (('1', ']['), 11.240109508056893), (('][', '2'), 11.240109508056893), (('are', 'used'), 11.240109508056893), (('artificial', 'intelligence'), 11.240109508056893), (('automatically', 'through'), 11.240109508056893)]
webtext: It is a type of PlaintextCorpusReader for corpora that consist of plaintext documents. Paragraphs are assumed to be split using blank lines.
We use BigramAssocMeasures to get a likelihood between terms.
We get a list of all words from our example.txt file in lower case.
We can see some pair of words in our output that contains two words and corresponding likelihood ratio.
But, we can see lots of meaningless terms in output. so we need to clean our data like ‘][‘, ‘2’, ‘is’, ‘are’, ‘(‘ to get meaningful terms.
Cleaning stopwords and numeric or punctuation
Stopwords are the English words which does not add much meaning to a sentence. For Example- ‘should’, ‘because’, ‘from’, ‘again’, ‘nor’ and many more.
We know that these meaningless words are filtered out before processing our data. As we want to process only text in our data, so we don’t need Numbers or Punctuations.
stopword= nltk.corpus.stopwords stopwords = set(stopword.words('english')) # stopwords #uncomment to know all english stopwords bigramFinder.apply_word_filter(lambda w: not w.isalpha() or w in stopwords ) bigramratio = bigramFinder.score_ngrams( bigrams.likelihood_ratio ) bigramratio[:5]
[(('machine', 'learning'), 27.06904985133383), (('artificial', 'intelligence'), 11.240109508056893), (('decisions', 'without'), 11.240109508056893), (('develop', 'conventional'), 11.240109508056893), (('email', 'filtering'), 11.240109508056893)]
Here, we can see that we get very meaningful bigrams as our output. From this, we can understand that which phrase frequently comes together.
Like bigram, we can generate three frequently terms together according to its likelihood ratio.
Just like we cleaned Bigaram, we will also clean all unwanted terms from our corpus. After then we will find Trigrams from our cleaned data.
trigrams = nltk.collocations.TrigramAssocMeasures() trigramFinder = nltk.collocations.TrigramCollocationFinder.from_words(words) trigramFinder.apply_word_filter(lambda w: not w.isalpha() or w in stopwords ) trigramratio = trigramFinder.score_ngrams( trigrams.likelihood_ratio ) trigramratio[:3]
[(('machine', 'learning', 'algorithms'), 56.14983940261841), (('mathematical', 'model', 'based'), 33.72032852417066), (('learning', 'algorithms', 'build'), 28.247545804343332)]
If you have any query or feel anything wrong you can comment below.
Thanks for reading 🙂