Similarity metrics of strings in Python

In this tutorial, we’ll learn about the Similarity metrics of strings using Python.

It is used in many fields of Computer Science such as Natural Language Processing, Machine Learning, and web development domains.
First, we’ll learn about how to find a similarity between two sentences then we’ll move towards generating similarity metrics of multiple strings using Python.
Different methods for it that we’ll explore in this tutorial are:

  1. Levenshtein distance method
  2. Sum and Zip methods
  3. SequenceMatcher.ratio() method
  4. Cosine similarity method

Using the Levenshtein distance method in Python

The Levenshtein distance between two words is defined as the minimum number of single-character edits such as insertion, deletion, or substitution required to change one word into the other.
First, we’ll install Levenshtein using a command

pip install python-Levenshtein

Import it using a command

import Levenshtein

Now, we’ll use the distance method which to calculate the Levenshtein distance as follows:

Levenshtein.distance("Hello World", "Hllo World")

Its corresponding output is as follows:

1

As we have to perform a single insertion operation to insert ‘e’ in word hllo to make it hello.

Using the sum and zip method in Python

The zip method is used to map the same index of different containers so that we can use them as a single entity.
First, we’ll initialize two strings and make their length equal.

s1 = 'Hello World'
s2 = 'Hello Word'
s1 = s1 + ' ' * (len(s2) - len(s1)) 
s2 = s2 + ' ' * (len(s1) - len(s2))

Now, initialize sum equal to 0.
Once performed zip operation, we’ll check if char of particular index in both strings are the same then increase sum by 1 else not. Finally, divide the sum by length of the first string and print the result.

sum = 0
for i,j in zip(s1, s2):
  if i==j:
    sum += 1
  else:
    sum += 0
sum = sum/float(len(s1))

print("Similarity between two strings is: " + str(sum) )

Its corresponding output is as follows:

Similarity between two strings is: 0.8181818181818182

Using SequenceMatcher.ratio() method in Python

It is an in-built method in which we have to simply pass both the strings and it will return the similarity between the two.
First, we’ll import SequenceMatcher using a command

from difflib import SequenceMatcher

Now, we’ll initialize the two strings and pass it to the SequenceMatcher method and finally print the result.

s1 = "I am fine"
s2 = "I are fine"
sim = SequenceMatcher(None, s1, s2).ratio()
print("Similarity between two strings is: " + str(sim) )

Its corresponding output is as follows:

Similarity between two strings is: 0.8421052631578947

Using Cosine similarity in Python

We’ll construct a vector space from all the input sentences. The number of dimensions in this vector space will be the same as the number of unique words in all sentences combined. Then we’ll calculate the angle among these vectors.

We’ll remove punctuations from the string using the string module as  ‘Hello!’ and ‘Hello’  are the same. Strings will be converted to numerical vectors using CountVectorizer. The most frequent words which give no meaning like ‘I’, ‘you’, ‘myself’, etc. will also be removed, these are known as stopwords.

So, first, we import the following packages using a command

import string
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
stopwords = stopwords.words("english")

To use stopwords, first, download it using a command

import nltk
nltk.download("stopwords")

Now, we’ll take the input string.

text =     [ "Hello World.",
             "Hello Word",
             "Another hello world.",
             "Welcome! in this new world." 
            ]

We’ll clean the text by removing punctuations, converting them into lowercase, and removing stopwords.

def clean_data(text):
  text = ''.join([ele for ele in text if ele not in string.punctuation])
  text = text.lower()
  text = ' '.join([ele for ele in text.split() if ele not in stopwords])
  return text

Now, instead of calling the above function for each sentence, let’s use the map function.

data = list(map(clean_data, text))

After cleaning, the data is as follows:

['hello world', 'hello word', 'another hello world', 'welcome new world']

Now, we’ll use CountVectorizer to convert the data into vectors.

vectorizer = CountVectorizer(data)
vectorizer.fit(data)
vectors = vectorizer.transform(data).toarray()

Finally, we’ll use the cosine similarity function to compute the cosine similarity.

cos_sim = cosine_similarity(vectors)
print(cos_sim)

Its corresponding output is as follows:

[[1.         0.5        0.81649658 0.40824829]
 [0.5        1.         0.40824829 0.        ]
 [0.81649658 0.40824829 1.         0.33333333]
 [0.40824829 0.         0.33333333 1.        ]]

You may notice the diagonal elements are always 1 because every sentence is always 100 percent similar to itself.
I hope you enjoyed this tutorial.

 

Leave a Reply

Your email address will not be published. Required fields are marked *