Aho-Corasick Algorithm for Pattern Searching in C++
Hello everyone, today we are going to perform the search for a pattern in a string using Aho-Corasick Algorithm in C++.
So, why is this algorithm better than other available ways to search for a pattern? That’s because this algorithm provides a much better time complexity as opposed to others. The Aho-Corasick Algorithm can operate at the complexity of :
O(length of original string+no.of words to search+word occurrence frequency).
Thus, compared to other algorithms, it provides a much better time complexity, and thus speeds up the search.
Aho-Corasick Algorithm for Pattern Searching in C++
Let us begin the code for the algorithm now:
As usual, begin by declaring the header files and namespace to define the scope. We may also pre-define some constants for easier readability.
#include <bits/stdc++.h> using namespace std; const int S =500; const int C =26; int op[S],int nf[S], int gt[S];
The constant declaration S is used to specify the length total of the letters that we wish to check. C stands for the no. of English alphabet letters. We also declare two 1-D arrays and one 2D array, where ‘op’ is for the output, ‘nf’ is for the failure and ‘gt’ is to go-to function.
Next, we define a function ‘matcher’, which makes the matching mechanism and then returns the no. of states.
First we declare the function and then review what it means.
int matcher(string letter[], int n) { memset(op, 0, sizeof op); memset(gt, -1, sizeof gt); int states = 1; for (int i = 0; i < n; ++i) { const string &word = letter[i]; int cs = 0; for (int j = 0; j < word.size(); ++j) { int ch = word[j] - 'a'; if (gt[cs][ch] == -1) gt[cs][ch] = states++; cs = gt[cs][ch]; } op[cs] = (1 << i); } for (int ch = 0; ch < C; ++ch) if (gt[0][ch] == -1) gt[0][ch] = 0; memset(nf, -1, sizeof nf); queue<int> q; for (int ch = 0; ch < C; ++ch) { if (gt[0][ch] != 0) { nf[gt[0][ch]] = 0; q.push(gt[0][ch]); } } while (q.size()) { int state = q.front(); q.pop(); for (int ch = 0; ch <= C; ++ch) { if (gt[state][ch] != -1) { int fail = nf[state]; while (gt[fail][ch] == -1) fail = nf[fail]; fail = gt[fail][ch]; nf[gt[state][ch]] = fail; op[gt[state][ch]] |= op[fail]; q.push(gt[state][ch]); } } } return states; }
The ‘memeset’ is an inbuilt function which assigns a memory block with a desired value. Thus, we preset all output values with 0 and go-to values with -1.
We then fill the go-to function, take the characters and insert it into the input and assign new nodes to the ‘ch’ variable. Once all done, we initialize the not found function ‘nf ‘ with -1 and use a queue data structure to compute using Breadth-First Search. All nodes of depth 1 is found to possess nf=0. We remove the front state, find nf, and proceed further to finally return the states.
We next define a function to take the current state and a character input, and return the state the code goes into.
int findnew(int cs, char ip) { int ans = cs; int ch = ip - 'a'; while (gt[ans][ch] == -1) ans = nf[ans]; return gt[ans][ch]; }
Now comes the most important function. This checks if the required substring is present or not.
void search(string letter[], int k, string input) { matcher(letter, k); int cs = 0; for (int i = 0; i < input.size(); ++i) { cs = findnew(cs, input[i]); if (op[cs] == 0) continue; for (int index = 0; index < k; ++index) { if (op[cs] & (1 << index)) { cout << "The required word you want - " << letter[index] << " -starts from "<< i - letter[index].size() + 1 << " until " << i << endl; } } } }
The matcher is the function we had defined above. We initialize a variable cs=0, and use a loop to scan through the main string to find the occurrences. If the substring is present, we output using a loop, else we move forward to the next character.
Once all done, we declare the main() to call the functions and run the code.
int main() { string letter[] = {"co", "speed", "code", "abcd"}; string input = "codespeedy"; int n = sizeof(letter)/sizeof(letter[0]); search(letter, n, input); return 0; }
We enter the main string ‘codespeedy’ and try to search for the words co/speed/ code / abcd.
OUTPUT
The required word you want - co -starts from 0 until 1 The required word you want - code -starts from 0 until 3 The required word you want - speed -starts from 4 until 8
Now, on checking the output, we find that the words co/speed/code to be available. It says starts from 0, as an array/string starts from index 0. Since ‘abcd’ is not present, we get no output.
Thus, by using the Aho-Corasick Algorithm, we found matching patterns. I hope the tutorial is of your help.
Also read: Unbounded fractional knapsack problem in C++
Quite a simple explanation. Very easy to understand.