Aho-Corasick Algorithm for Pattern Searching in C++

Post Views: 875

Hello everyone, today we are going to perform the search for a pattern in a string using Aho-Corasick Algorithm in C++.

So, why is this algorithm better than other available ways to search for a pattern? That’s because this algorithm provides a much better time complexity as opposed to others. The Aho-Corasick Algorithm can operate at the complexity of :

O(length of original string+no.of words to search+word occurrence frequency).

Thus, compared to other algorithms, it provides a much better time complexity, and thus speeds up the search.

Aho-Corasick Algorithm for Pattern Searching in C++

Let us begin the code for the algorithm now:

As usual, begin by declaring the header files and namespace to define the scope. We may also pre-define some constants for easier readability.

#include <bits/stdc++.h> 
using namespace std; 

const int S =500; 
const int C =26; 
int op[S],int nf[S], int gt[S];

The constant declaration S is used to specify the length total of the letters that we wish to check. C stands for the no. of English alphabet letters. We also declare two 1-D arrays and one 2D array, where ‘op’ is for the output, ‘nf’ is for the failure and ‘gt’ is to go-to function.

Next, we define a function ‘matcher’, which makes the matching mechanism and then returns the no. of states.

First we declare the function and then review what it means.

int matcher(string letter[], int n) 
{ 
  memset(op, 0, sizeof op); 
  memset(gt, -1, sizeof gt); 
  int states = 1; 

  for (int i = 0; i < n; ++i) 
  { const string &word = letter[i]; 
    int cs = 0; 
   for (int j = 0; j < word.size(); ++j) 
    { 
      int ch = word[j] - 'a'; 
 
      if (gt[cs][ch] == -1) 
        gt[cs][ch] = states++; 

      cs = gt[cs][ch]; 
    } 
    op[cs] = (1 << i); 
  } 


  for (int ch = 0; ch < C; ++ch) 
    if (gt[0][ch] == -1) 
      gt[0][ch] = 0; 


  memset(nf, -1, sizeof nf); 


  queue<int> q; 

  for (int ch = 0; ch < C; ++ch) 
  { 
    if (gt[0][ch] != 0) 
    { 
      nf[gt[0][ch]] = 0; 
      q.push(gt[0][ch]); 
    } 
  } 

  while (q.size()) 
  { 
    int state = q.front(); 
    q.pop(); 

    for (int ch = 0; ch <= C; ++ch) 
    { 
      if (gt[state][ch] != -1) 
      { 
        int fail = nf[state]; 

        while (gt[fail][ch] == -1) 
          fail = nf[fail]; 

        fail = gt[fail][ch]; 
        nf[gt[state][ch]] = fail; 
        op[gt[state][ch]] |= op[fail]; 

        q.push(gt[state][ch]); 
      } 
    } 
  } 

  return states; 
}

The ‘memeset’ is an inbuilt function which assigns a memory block with a desired value. Thus, we preset all output values with 0 and go-to values with -1.

We then fill the go-to function, take the characters and insert it into the input and assign new nodes to the ‘ch’ variable. Once all done, we initialize the not found function ‘nf ‘ with -1 and use a queue data structure to compute using Breadth-First Search. All nodes of depth 1 is found to possess nf=0. We remove the front state, find nf, and proceed further to finally return the states.

We next define a function to take the current state and a character input, and return the state the code goes into.

int findnew(int cs, char ip) 
{       int ans = cs; 
  int ch = ip - 'a'; 
        while (gt[ans][ch] == -1) 
  ans = nf[ans]; 
  return gt[ans][ch]; 
}

Now comes the most important function. This checks if the required substring is present or not.

void search(string letter[], int k, string input) 
{ 
  matcher(letter, k); 
  int cs = 0; 
  for (int i = 0; i < input.size(); ++i) 
  { 
    cs = findnew(cs, input[i]); 
    if (op[cs] == 0) 
      continue;  
    for (int index = 0; index < k; ++index) 
    { 
      if (op[cs] & (1 << index)) 
      { cout << "The required word you want - " << letter[index] << " -starts from "<< i - letter[index].size() + 1 << " until " << i << endl; 
      } } } }

The matcher is the function we had defined above. We initialize a variable cs=0, and use a loop to scan through the main string to find the occurrences. If the substring is present, we output using a loop, else we move forward to the next character.

Once all done, we declare the main() to call the functions and run the code.

int main() 
{ 
  string letter[] = {"co", "speed", "code", "abcd"}; 
  string input = "codespeedy"; 
  int n = sizeof(letter)/sizeof(letter[0]); 

  search(letter, n, input); 

  return 0; 
}

We enter the main string ‘codespeedy’ and try to search for the words co/speed/ code / abcd.

OUTPUT

The required word you want - co -starts from 0 until 1
The required word you want - code -starts from 0 until 3
The required word you want - speed -starts from 4 until 8

Now, on checking the output, we find that the words co/speed/code to be available. It says starts from 0, as an array/string starts from index 0. Since ‘abcd’ is not present, we get no output.

Thus, by using the Aho-Corasick Algorithm, we found matching patterns. I hope the tutorial is of your help.

Also read: Unbounded fractional knapsack problem in C++

One response to “Aho-Corasick Algorithm for Pattern Searching in C++”

Aditya says:

April 30, 2020 at 10:51 am

Quite a simple explanation. Very easy to understand.

Reply

Aho-Corasick Algorithm for Pattern Searching in C++

Aho-Corasick Algorithm for Pattern Searching in C++

One response to “Aho-Corasick Algorithm for Pattern Searching in C++”

Leave a Reply Cancel reply

Related Posts