by

In every programmer’s life comes a time where they need to know if words from a defined set of words appear in input texts.

Today, we’ll see how to do it using a naïve solution, and efficiently using Aho-Corasick algorithm.

Real world usage

I recently used Aho-Corasick algorithm myself.

I needed to search for a defined set of abbreviated phrases appearing in input strings from third party providers and transform them in order to produce human friendly text.

I stored pairs of <abbreviation, replacement> in a DB table. On system start up and every midnight, the code would rebuild the trie and automaton of the abbreviations (described below), to allow the update of the dictionary without interrupting the program.

Every input string was matched against the trie. Then, every matched abbreviation that was found by the algorithm was replaced with its human friendly phrase, turning a cryptic text to a lovely one.

As a side note, know that there are many libraries that implement this code, so you don’t have to write it yourself. You can find one of those in the resources section.

So let’s define the mission

Given a finite set of words (a “dictionary”), find the words that appear in an input text.

Here’s an example:
Say we have the following set of words: {he, she, hers, his}
And an input text: ahishers

We expect the output to be:
-Word his appears from 1 to 3
-Word he appears from 4 to 5
-Word she appears from 3 to 5
-Word hers appears from 4 to 7

How do we do it?

The naïve solution
The naïve solution would be iterating over the dictionary, and checking each word in the dictionary against the input text.

For a dictionary of 4 words we iterate over the input text 4 times, once for each word in the dictionary.

That means that the complexity of the solution is O(m*n + k), where m is the number of words in the dictionary, n is the length of the input text and k is the total number of characters of all the words in the dictionary.

The Aho-Corasick way
Aho–Corasick algorithm is a string searching algorithm invented by Alfred V. Aho and Margaret J. Corasick. It is a dictionary-matching algorithm that locates elements of a finite set of strings (the "dictionary") within an input text.

The really cool bit is that it matches all strings simultaneously. How? By building an automaton. The first step of the algorithm is to build a trie (a search tree) that represents the dictionary. The second step is to add edges that turn this simple trie into a linear time matching automaton. These edges allow fast transitions between failed string matches to other branches of the trie that share a common prefix, enabling the automaton to transition between string matches without the need for backtracking.

That sounds kind of complicated, right? So let’s break it down.
We said the first step is to build a trie - a search tree – to represent the dictionary.

As you remember, our dictionary is {he, she, hers, his}.

So our first step will produce a simple trie:

Step 1 - Aho Corasick trie

The next step is to add the edges that turn this simple trie into a linear time matching automaton:

Step 2 - Aho Corasick automaton

All that is left now is the matching - traverse the given text over the built automaton to find all matching words.

When the dictionary of strings is known in advance (like the abbreviations from third party providers), the construction of the automaton can be performed once and stored for consecutive uses in order to save time.

The complexity of the algorithm is O(n + k + z), where n is the length of the input text, k is the total number of characters of all the words in the dictionary and z is total number of occurrences of words in text.

Note – we now achieved the same goal in linear time!

Conclusion

In this post we tackled a mission – how to find matches of a finite number of strings in input strings.

We looked at a real life use case and examined two solutions.

The naïve solution had quadratic complexity.

However, using Aho-Corasick algorithm, we managed to get the same results in linear time, making everybody happy :)

Resources, demoes and further reading:

-http://blog.ivank.net/aho-corasick-algorithm-in-as3.html
-http://www.geeksforgeeks.org/aho-corasick-algorithm-pattern-searching/
-http://www.cs.uku.fi/~kilpelai/BSA05/lectures/slides04.pdf
-https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm
-Special bonus: https://github.com/robert-bor/aho-corasick