Fuzzy String Matching in Python
Google defines fuzzy as difficult to perceive, indistinct or vague. Unlike boolean, fuzzy logic answers the question of how much similar are the strings. Boolean logic simply answers whether the strings are the same or not. In this tutorial, we will learn approximate string matching also known as fuzzy string matching in Python.
Levenshtein distance is also known as edit distance. It simply measures the difference between two sequences. For example, the Levenshtein distance between ‘hello’ and ‘belly’ is 2. First, substitute h in place of b. Next, o in place of y. 2 is the minimum number of edits you can make to interchange the given two strings. The maximum distance possible would be the length of the longer string. Furthermore, the distance would be zero if the strings are exactly the same.
Fuzzy string matching uses Levenshtein distance in a simple-to-use package known as Fuzzywuzzy.
Not only does this package has a cute name, but also it comes in very handy while fuzzy string matching. Download it using:
pip install fuzzywuzzy
There are two main modules in this package- fuzz and process. You can import them in your code.
from fuzzywuzzy import fuzz from fuzzywuzzy import process
Fuzz module has the following methods:
Process module has these functions :
Let us implement them in python one by one.
Fuzzy string matching
Here we go.
It simply uses Levenshtein distance to measure the difference between two strings. Here is an example:
>>> from fuzzywuzzy import fuzz >>> str1 = 'hello' >>> str2 = 'hellow' >>> ratio = fuzz.ratio(str1, str2) >>> print(ratio) 91
It looks like a simple misspell and both means the same. Therefore, simple ratio is convenient to use. Since only one substitution is required, the similarity ratio is 91.
Partial ratio is used to deal with comparing strings a little more complex than the previous one. Let us see an example:
>>> str1 = 'hello how are you' >>> str2 = 'how are you' >>> partial_ratio = fuzz.partial_ratio(str1, str2) >>> ratio = fuzz.ratio(str1, str2) >>> print(partial_ratio) 100 >>> print(ratio) 79
The algorithm works in such a way that it calculates the length of the two strings and checks whether the shorter one is a part of longer one or not. Thus, we got the score 100 for partial ratio whereas 79 for a simple ratio. Partial ratio understands that both the strings refer to the same question.
There may be times when the words are just rearranged in a string but they may mean the same. Token sort ratio helps in such conditions:
>>> str1 = 'hobbs and shaw' >>> str2 = 'shaw and hobbs' >>> token_sort_ratio = fuzz.token_sort_ratio(str1, str2) >>> ratio = fuzz.ratio(str1, str2) >>> ratio = fuzz.partial_ratio(str1, str2) >>> print(token_sort_ratio) 100 >>> print(ratio) 36 >>> print(partial_ratio) 53
Here you can see that the rearrangement of the words did not affect token sort ratio. Hence, the score was 100. Whereas, simple ratio and partial ratio had lower similarity scores. In this method, the algorithm first sorts each string tokens alphabetically and then joins them. (A string token is a set of characters between two spaces). Subsequently, simple ratio is applied to determine the similarity scores.
Even with all the above ratios, the computer sometimes finds it hard to determine the similarity between two strings. When the string tokens are longer, rearranged or repeated, they become much more complex. Thus, token set ratio comes into the picture. Here is an example:
>>> str1 = 'fuzzy wuzzy was a cute bear' >>> str2 = 'wuzzy fuzzy fuzzy was a bear with white furs' >>> ratio = fuzz.token_set_ratio(str1, str2) >>> ratio = fuzz.token_sort_ratio(str1, str2) >>> partial_ratio = fuzz.partial_ratio(str1, str2) >>> ratio = fuzz.ratio(str1, str2) >>> print(token_set_ratio) 90 >>> print(token_sort_ratio) 68 >>> print(partial_ratio) 78 >>> print(ratio) 59
Token set ratio is much more flexible than token sort. Instead of sorting and using simple ratio, the algorithm works in this manner:
- [sorted_intersection] + [sorted_rest_of_strings_in_str1]
- [sorted_intersection] + [sorted_rest_of_strings_in_str2]
and then each one is compared using simple ratio. Here, the sorted intersection means common tokens between the two strings sorted in the alphabetical order. Sorted rest of the strings refer to remaining of the tokens in the string. Let us illustrate this by taking the above example:
>>> a = 'a bear fuzzy was wuzzy' #sorted_intersection >>> b = 'a bear fuzzy was wuzzy cute' #sorted_intersection and sorted_rest of the strings in str1 >>> c = 'a bear fuzzy was wuzzy furs fuzzy white with' #sorted_intersection and sorted_rest of the strings in str2 >>> fuzz.ratio(a, b) 90 >>> fuzz.ratio(a, c) 67 >>> fuzz.ratio(b, c) 73
Hence, the token set ratio gives similarity score 90.
process.extract & process.extractOne
Process is a module in fuzzywuzzy which extracts the most similar choice out of all the available options. We can easily find the similarity ratio of each option using extract() method. Let us check this out with an example:
>>> from fuzzywuzzy import process >>> options = ['white flower', 'pink dress', 'teddy bear', 'pink flower'] >>> find = 'flower' >>> process.extract(find, options) [('white flower', 90), ('pink flower', 90), ('teddy bear', 30), ('pink dress', 15)] >>> process.extract(find, options, limit=2) [('white flower', 90), ('pink flower', 90)] >>> process.extractOne(find, options) ('white flower', 90)
Fuzzy string matching has many application in computer science and other fields. Spell check, DNA matching, spam filtering, etc.
Thus, we have learnt how to determine similarity between two strings and to extract the most similar from the available options. In the process, we learnt about Fuzzywuzzy library, it’s modules-fuzz and process. Additionally, we also learnt about the important functions available in each module.
This is all about Fuzzy String Matching in Python.