Explain the Aho-Corasick Algorithm for Pattern Searching in Python
Aho-Corasick algorithm, a type of Dictionary-matching algorithm. This algorithm can be helpful to find word or words appearing from a set of keywords or data we feed. Aho-Corasick algorithm is a fast and efficient way of finding word and location. As because we want things to be found easily and efficiently so it finds the easy way of finding through the dictionary we provide.
Introduction: More about Aho-Corasick Algorithm
Aho-Corasick is a string searching algorithm and was invented by Alfred V. Aho and Margaret J. Corasick.
Aho-Corasick algorithm makes a state machine and also uses a TRIE concept.
The algorithm is implemented by using a tree data structure. Once we make the tree, what it does is, it coverts or try to convert tree in the form of automaton which helps to take linear time to complete or make the search.
Aho-Corasick Algorithm is subdivided into three major phases:
- Go(go to the main function and set points and the main root)
- Output(The results or outcome after the condition matches and availability matches)
- Failure (If the keyword does not match then it failed to count it)
Go stage: It forms the tree with the help of provided keywords, the data which we feed in as recognisable pattern/design.
Failure stage: It seeks to look for the backward transformation to get appropriate appendix of keywords.
Output stage: It searches all word which ends at a particular state. Suppose it ends at the state ‘a’ for each state ‘a’ of the automaton.
Time complexity
If we talk about the time complexity of this algorithm.
Let L is the length of entered Text, let B is the length of keywords (dataset), let X be the number of possible matches or matches made.
Then, the time complexity of Algorithm is: O(L+B+X)
Input and Output
How it actually works:
Input:
Sample set : {‘you’, ‘are’, ‘good’}
Searched string: “youarenotkindbutgoodboy”
Output:
The word ‘you’ found at position :0
The word ‘are’ found ate position :3
The word ‘good’ found at position: 16
In Python, it works with the help of a Trie. We can also learn it from the technique of a Trie.
Code (Python)
The CODE is mainly divided into four sections:
- We create the tree (defining function)
- We create the state transition (defining function)
- We now find the string given as input (defining function)
- The main section where we give the design (patterns) that is the data set we provide the system and the input string to search and then we call the function.
Below is our Python program for Aho-Corasick Algorithm for Pattern Searching:
print "Hello, World!"
class Ahomain:    #We create class for Aho-Corasick
    def __init__(self):   #constructor with its 
        self.go = {}      
        self.out = []
        self.breaks = None
 
def aho_treeform(list1):  #creating the tree
    main = Ahomain()      # Object of Aho class   
 
    for way in list1:
        point = main
        for sym in way:
            point = point.go.setdefault(sym,
Ahomain())
        point.out.append(way)
    return main
def aho_state_transition(list1):  #state transition
    main = aho_treeform(list1)    # machine 
    queue = []
    for point in main.go.itervalues():
        queue.append(point)
        point.breaks = main
 
    while len(queue) > 0:
        rightpoint = queue.pop(0)
 
        for clue,uniquepoint in rightpoint.go.iteritems():
            queue.append(uniquepoint)
            firstpoint = rightpoint.breaks
            while firstpoint != None and not firstpoint.go.has_key(clue):
                firstpoint = firstpoint.breaks
            uniquepoint.breaks = firstpoint.go[clue] if firstpoint else main
            uniquepoint.out += uniquepoint.breaks.out
 
    return main
 
 
def aho_search(y, main, call):  #searching the input
    point = main
 
    for i in xrange(len(y)):
        while point != None and not point.go.has_key(y[i]):
            point = point.breaks
        if point == None:
            point = main
            continue
        point = point.go[y[i]]
        for design in point.out:
            call(i - len(design) + 1, design)
def found(loc, list1):    #printing the results
    print "The Design found at position %s, found pattern: %s" % (loc, list1)
 
list1 = ['a', 'ab', 'aa', 'abc', 'bc', 'bca', 'cc', 'c', 'cba', 'cab']
y = "abcbaacab"
main = aho_state_transition(list1)
aho_search(y, main, found)Output :
Hello, World! The design found at position 0, found pattern: a The design found at position 0, found pattern: ab The Design found at position 0, found pattern: abc The design found at position 1, found pattern: bc The Design found at position 2, found pattern: c The Design found at position 2, found pattern: cba The Design found at position 4, found pattern: a The Design found at position 4, found pattern: aa The Design found at position 5, found pattern: a The Design found at position 6, found pattern: c The Design found at position 7, found pattern: a The Design found at position 6, found pattern: cab The Design found at position 7, found pattern: ab
About the code:
We can also use the input function after we have written the found function. We can take input for the list1 by the help of loop, enter the data and it will append in the list.
We can also use input for the ‘y’ from the user.
y=input (“Enter the string to search”). In this case, we used a, b, c. But we can add more to the list and search it from the input.
If this code does not run on your Python IDLE or computer.
Just make some changes.
1. Instead of using d.itervalues and d. iteritems , we can also write iter(d.values()) and iter(d.items()) . This happens because some version nay does not support some function of Python.
2. Instead of using .has_key we can use ‘ in ‘ . Example: go.has_key(clue) can be written as go in(clue).
3. We can use a range instead of writing range.
4. You may face the error of indentation or outer scope than you should declare list 1 and main as global.
And rest we can change variable names, no more changes required
If you want to know more visit the Python documentation site, If you face version compatibility errors. https://docs.python.org/3.1/whatsnew/3.0.html
I hope I could serve your request and made you understood. Will appreciate your feedback.
as much as this is helpful , I wish you commented the code for more details 🙁
Hello, I didn’t understand your query. Is there any error in the code? Should we make some changes?
im trying to use aho corasick algorithms to extract some metadata from a text database, please help me with it, m lost