Aho–Corasick algorithm

In computer science, the Aho–Corasick algorithm is a string-searching algorithm invented by Alfred V. Aho and Margaret J. Corasick. It is a kind of dictionary-matching algorithm that locates elements of a finite set of strings within an input text. It matches all strings simultaneously. The complexity of the algorithm is linear in the length of the strings plus the length of the searched text plus the number of output matches. Note that because all matches are found, there can be a quadratic number of matches if every substring matches.
Informally, the algorithm constructs a finite-state machine that resembles a trie with additional links between the various internal nodes. These extra internal links allow fast transitions between failed string matches, to other branches of the trie that share a common prefix. This allows the automaton to transition between string matches without the need for backtracking.
When the string dictionary is known in advance, the construction of the automaton can be performed once off-line and the compiled automaton stored for later use. In this case, its run time is linear in the length of the input plus the number of matched entries.
The Aho–Corasick string-matching algorithm formed the basis of the original Unix command fgrep.

Example

In this example, we will consider a dictionary consisting of the following words:.
The graph below is the Aho–Corasick data structure constructed from the specified dictionary, with each row in the table representing a node in the trie, with the column path indicating the sequence of characters from the root to the node.
The data structure has one node for every prefix of every string in the dictionary. So if is in the dictionary, then there will be nodes for,,, and. If a node is in the dictionary then it is a blue node. Otherwise it is a grey node.
There is a black directed "child" arc from each node to a node whose name is found by appending one character. So there is a black arc from to.
There is a blue directed "suffix" arc from each node to the node that is the longest possible strict suffix of it in the graph. For example, for node, its strict suffixes are and and. The longest of these that exists in the graph is. So there is a blue arc from to. The blue arcs can be computed in linear time by performing a breadth-first search starting from the root. The target for the blue arc of a visited node can be found by following its parent's blue arc to its longest suffix node and searching for a child of the suffix node whose character matches that of the visited node.
There is a green "dictionary suffix" arc from each node to the next node in the dictionary that can be reached by following blue arcs. For example, there is a green arc from to because is the first node in the dictionary that is reached when following the blue arcs to and then on to. The green arcs can be computed in linear time by repeatedly traversing blue arcs until a blue node is found, and memoizing this information.

Path	In dictionary	Suffix link	Dict suffix link
	-
	+
	+
	-
	-
	+
	+
	+
	+
	-
	+

At each step, the current node is extended by finding its child,
and if that doesn't exist, finding its suffix's child, and if
that doesn't work, finding its suffix's suffix's child, and so on, finally
ending in the root node if nothing's seen before.
When the algorithm reaches a node, it outputs all the dictionary
entries that end at the current character position in the input text. This is done
by printing every node reached by following the dictionary suffix links, starting
from that node, and continuing until it reaches a node with no dictionary suffix link.
In addition, the node itself is printed, if it is a dictionary entry.
Execution on input string yields the following steps:

Node	Remaining string	Output:end position	Transition	Output
	abccab		start at root
	bccab	a:1	to child	Current node
	ccab	ab:2	to child	Current node
	cab	bc:3, c:3	to suffix to child	Current Node, Dict suffix node
	ab	c:4	to suffix to suffix to child	Current node
	b	a:5	to child	Dict suffix node
		ab:6	to suffix to child	Current node

Changeable search list

The original Aho-Corasick algorithm assumes that the set of search strings is fixed. It does not directly apply to applications in which new search strings are added during application of the algorithm. An example is an interactive indexing program, in which the user goes through the text and highlights new words or phrases to index as he or she sees them. Bertrand Meyer introduced an incremental version of the algorithm. in which the search string set can be incrementally extended during the search, retaining the algorithmic complexity of the original.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...