tf-idf Model for Page Ranking in Python

The tf-idf stands for Term frequency-inverse document frequency. It is a weighing schema that measures the frequency of every term in a document of the corpus. These weight vectors in a vector space are then used for information retrieval and text mining. Hence, tf-idf matrix tries to evaluate the importance of the word in a document of the corpus. The different weighting schemes are used for extracting information from the web, in the search engine for scoring, ranking, and retrieving relevant information and displaying the results.
This model has two components:
-> TF (Term Frequency)
-> IDF (Inverse Document Frequency)

tf-idf Model for Page Ranking

Let’s go step by step.

We are creating two documents for simplicity.

docA = "The car is driven on the road"
docB = "The truck is driven on the highway"

Now we need to tokenize words. Tokenization is the first step of the preprocessing of the textual data. It creates a list of tokens of the document.

from nltk.tokenize import word_tokenize
tokens1 = word_tokenize(docA)
tokens2 = word_tokenize(docB)
tokens1, tokens2
(['The', 'car', 'is', 'driven', 'on', 'the', 'road'],
 ['The', 'truck', 'is', 'driven', 'on', 'the', 'highway'])

Secondly, we are creating a function for calculating the frequency of words in each document. This function returns the term frequency and normalized term frequency.

wordset = set(tokens1).union(set(tokens2))

def computeTF(doc):
    raw_tf = dict.fromkeys(wordset,0)
    norm_tf = {}
    bow = len(doc)
    for word in doc:
        raw_tf[word]+=1   ##### term frequency
    for word, count in raw_tf.items():
        norm_tf[word] = count / float(bow)  ###### Normalized term frequency
    return raw_tf, norm_tf      

The first step to our tf-idf model is calculating the Term Frequency (TF) in the corpus. Corpus is the collection of all the documents.

Term Frequency: It is the frequency of words in each document in the corpus. It is the ratio of the frequency of words and the total number of words in the document.

tf_dictA, norm_tf_dictA = computeTF(tokens1)

tf_dictB, norm_tf_dictB = computeTF(tokens2)
print('Term Frquency for doc1\n')
print('\n Normalized tf\n')
Term Frquency for doc1

{'highway': 0, 'driven': 1, 'The': 1, 'the': 1, 'road': 1, 'truck': 0, 'is': 1, 'car': 1, 'on': 1}

 Normalized tf

{'highway': 0.0, 'driven': 0.14285714285714285, 'The': 0.14285714285714285, 'the': 0.14285714285714285, 'road': 0.14285714285714285, 'truck': 0.0, 'is': 0.14285714285714285, 'car': 0.14285714285714285, 'on': 0.14285714285714285}

The second step is to create Inverse Document Frequency

Inverse Document Frequency (IDF) : TF measures the frequency of words in each document in the corpus, so higher the frequency more important the word but this model doesn’t account for the word that occurs too often. So, IDF is an attenuation effect that scales down the term weights with its collection frequency. The idea is to reduce the TF weights with the factor of collection frequency. Hence it will give the higher weights to the terms which rarely occur.

def computeIdf(doclist):
    import math
    idf = dict.fromkeys(doclist[0].keys(),float(0))
    for doc in doclist:
        for word, val in doc.items():
            if val > 0:
                idf[word] += 1
    for word, val in idf.items():
        idf[word] = math.log10(len(doclist) / float(val))
    return idf
idf = computeIdf([tf_dictA, tf_dictB])

{'highway': 0.3010299956639812,
 'driven': 0.0,
 'The': 0.0,
 'the': 0.0,
 'road': 0.3010299956639812,
 'truck': 0.3010299956639812,
 'is': 0.0,
 'car': 0.3010299956639812,
 'on': 0.0}

And finally, compute the tf-Idf weights of every term in the corpus.

def computeTfidf(norm_tf,idf):
    tfidf = {}
    for word , val in norm_tf.items():
        tfidf[word] = val*idf[word]
    return tfidf
tfidfA = computeTfidf(norm_tf_dictA,idf)
tfidfB = computeTfidf(norm_tf_dictB,idf)

{'highway': 0.0,
 'driven': 0.0,
 'The': 0.0,
 'the': 0.0,
 'road': 0.043004285094854454,
 'truck': 0.0,
 'is': 0.0,
 'car': 0.043004285094854454,
 'on': 0.0}

Now the model is ready for the Page Ranking or other scoring techniques for information retrieval which are relevant.

Leave a Reply