Pattern Search in String with Rabin-Karp Algorithm in Python
Searching for a string pattern in the main string using the naive algorithm may be time-consuming and inefficient. The following post deals with a slightly more efficient method – the Rabin-Karp Algorithm, to perform the same task. It will finally return the starting indices of all the matches found.
Prerequisites: Basics of python strings, the naive algorithm (<please add my naive algorithm pattern search post’s internal link here>)
Rabin-Karp Algorithm
The Rabin-Karp algorithm provides a cut down on the number of substrings we match character by character in case of the naive algorithm. It does so by providing by filtering the substrings on a different basis first. It finds the hash values of a substring before comparing it character by character.
What is Hashing?
Hashing basically means converting one form of data into another. The most common way is converting strings into an integer representation. These methods are very commonly used in cryptography, compression, password authentication, etc.
Hashing in Rabin-Karp
In this algorithm, we use hashing to convert each substring to an equivalent integer representation. The hashing method we adopt here is Rabin-Karp rolling hash method.
The hash function may be defined as,
hash(string[m,m+1,….n-1,n]) = {string[m]*(p^(n-1)) + string[m+1]*(p^(n-2)) + ……. + string[n]*(p^0)} mod(pn)
where, m – beginning of substring in main string
n – end of substring in main string
p – some constant
pn – some prime number
The constant chosen can be usually arbitrary ensuring, however, that it is large enough to accommodate all possible characters in the string distinctly. We choose 26 in this implementation since there are 26 alphabets. When looked at carefully, it is basically like converting a number of base 26 into a decimal, i.e we multiply each digit with the base raised to its place value (refer to this for more).
We perform a mod operation using an arbitrary prime number simply to avoid over utilisation of memory for very large substrings. This, however, may cause different substrings to have the same hash value on some occasions. If the prime number is kept sufficiently long, this won’t happen too often and won’t affect the performance of the algorithm significantly.
Faster Hashing
It is important to note here, that if we keep finding the hash value at each iteration using the above method, it is as good as comparing the entire string. That’s because we have to iterate through the entire string in both cases! Hence, once the hash value is found for the first substring, the next hashes can be found using the previous hash. The following formula is used for this:
new_hash = {p*[prev_hash – string[m]*(p^(n-1))] + [string(n+1)*(p^0)]mod(pn)}
This formula is merely mathematical. It removes the first digit, multiplies the number by the place value and adds the last character of the new substring (the only new character in the substring). This can be shown using a decimal number, say 267. 267-(2*100) = 67. Then, 67*10 = 670. Now if the new digit is, say 8, then 67+8 = 678. Hence we removed 2 from 267 from the beginning and added 8 at the end.
Back to Rabin-Karp Algorithm
So we find the hash value for each substring and check for character-wise matching only if the hash values match. That is, the pattern and the substring have the same hash value. This helps us cut down on a large number of iterations, without having to compare entire substrings.
Rabin-Karp Algorithm in Python
Consider the following program,
def rk_search(string,pat,lconst): #lconst is the large constant used to limit the maximum hash value string = string.upper() pat = pat.upper() #ASSUMING ALL CHARACTERS ARE UPPPER_CASE, #Can be extended for lower case if necessary l = len(string) l_p = len(pat) con = 26 #The constant for base system 26 hashval = 0 #For the pattern currhash = 0 #For each substring for i in range(l_p): hashval += ((ord(pat[i])-ord('A')+1)*(con**(l_p-i-1)))%lconst currhash += ((ord(string[i])-ord('A')+1)*(con**(l_p-i-1)))%lconst for ind in range(l-l_p+1): if ind!=0: currhash = (con*(currhash-((ord(string[ind-1])-ord('A')+1)*(con**(l_p-1))))+((ord(string[ind+l_p-1])-ord('A')+1))%lconst) if(currhash==hashval): i,j = 1,ind+1 while(i<l_p): if string[j]!=pat[i]: break i += 1 j += 1 else: print "Found at index",ind
This is the complete implementation of the said logic.
hashval is calculated for the pattern and currhash is calculated for each substring in the iteration (except for the first one, for which the long method is used). Note that we are considering A=1, B=2……Z=26. Whenever the hash values match for the pattern and substring, we are comparing, we check character-wise and find out if the substring is present.
Small-scale Implementation
In case we only have small substrings and memory is not a major issue, we can ignore the mod part of the hashing. In this case, the hash values will always be unique and it is sufficient to check only the hash values of the pattern and substring. If they are equal, the pattern is found. The program is modified for this case below,
def rk_search(string,pat): string = string.upper() pat = pat.upper() #ASSUMING ALL CHARACTERS ARE UPPPER_CASE, #Can be extended for lower case if necessary l = len(string) l_p = len(pat) con = 26 #The constant for base system 26 hashval = 0 #For the pattern currhash = 0 #For each substring for i in range(l_p): hashval += (ord(pat[i])-ord('A')+1)*(con**(l_p-i-1)) currhash += (ord(string[i])-ord('A')+1)*(con**(l_p-i-1)) for ind in range(l-l_p+1): if ind!=0: currhash = con*(currhash-((ord(string[ind-1])-ord('A')+1)*(con**(l_p-1))))+(ord(string[ind+l_p-1])-ord('A')+1) if(currhash==hashval): print "Found at index",ind
For a sample run, let’s search for {rk_search(“AABAACAADAABAABA”,”AABA”) }
In both cases, the output is as follows,
So that was about the Rabin-Karp Algorithm
Feel free to leave behind any sort of feedback, suggestions, doubts below
Also read:
Leave a Reply