5 months, 1 week

Pattern Searching Algorithms

Pattern Searching algorithms are also known as String Searching Algorithm, which is  used to find a pattern or substring from another bigger string. There are different algorithms. The main goal to design these type of algorithms to reduce the time complexity.

Naive algorithm 
It checks for all character of the main string to the pattern. This algorithm is helpful for smaller texts. It does not need any pre-processing phases.
The best case occurs when the first character of the pattern is not present in the text at all. The number of comparisons in the best case is O(n). 
The number of comparisons in the worst case is O(m*(n-m+1)).( When all characters of the text and pattern are the same or  when only the last character is different.)


Aho-Corasick Algorithm

It is a kind of dictionary-matching algorithm that locates elements of a finite set of strings within an input text. It matches all strings simultaneously.
Aho-Corasick Algorithm finds all words in O(n + l + z), where ‘n’ is the length of the text, ‘l’ is the length of keywords, and ‘z’ is the number of matches.O(l * q), where ‘q’ is the length of the alphabet since that is the maximum number of children a node can have.

The Aho–Corasick string matching algorithm formed the basis of the original Unix command fgrep. 

There are three different phases of Aho-Corasick Algorithm. These are Go-to, Failure, and Output.

Go To :   This function simply follows edges of Trie of all words in arr[]. It is represented as 2D array g[][] where we store next state for current state   and character.

Failure : This function stores all edges that are followed when current character doesn't have edge in Trie. It is represented as 1D array f[] where we  store next state for current state. 

Output :  Stores indexes of all words that end at current state. It is represented as 1D array o[] where we store indexes of all matching words as a bitmap  for current state.

Matching : Traverse the given text over built automaton to find all matching words.

● To construct the Aho-Corasick automaton, we need to
● construct the trie,
● construct suffix links, 
● construct output links.

"pyahocorasick is a fast and memory efficient library for exact or approximate multi-pattern string search meaning that you can find multiple key strings occurrences at once in some input text. The strings “index” can be built ahead of time and saved (as a pickle) to disk to re re-sed later. The library provides an ahocorasick Python module that you can use as a plain dict-like Trie or convert a Trie to an automaton for efficient Aho-Corasick search."

Anagram Substring Search
Anagrams are basically all permutations of a given string or pattern. This pattern searching algorithm is slightly different. In this case, not only the exact pattern is searched, it searches all possible arrangements of the given pattern in the text.The time Complexity of Anagram Pattern Search Algorithm is O(n). 


Boyer Moore Algorithm 
Boyer–Moore string-search algorithm is an efficient string-searching algorithm that is the standard benchmark for practical string-search literature. It was developed by Robert S. Boyer and J Strother Moore in 1977. The Boyer–Moore algorithm uses information gathered during the preprocess step to skip sections of the text, resulting in a lower constant factor than many other string search algorithms. In general, the algorithm runs faster as the pattern length increases. The key features of the algorithm are to match on the tail of the pattern rather than the head, and to skip along the text in jumps of multiple characters rather than searching every single character in the text. This algorithm does preprocessing, but not the whole text.

Why is it good? We can skip multiple characters at the same time rather than searching every single character in the text.

The two strategies are called heuristics of Boyer–Moore as they are used to reduce the search. These two shift functions are called the good-suffix heuristic (also called matching shift) and the bad-character heuristic (also called the occurrence shift).

Both of the above heuristics can also be used independently to search a pattern in a text. The Boyer Moore algorithm does preprocessing  so that the pattern can be shifted by more than one. It processes the pattern and creates different arrays for each of the two heuristics. At every step, it slides the pattern by the max of the slides suggested by each of the two heuristics. So it uses greatest offset suggested by the two heuristics at every step. The Boyer Moore algorithm starts matching from the last character of the pattern.

Bad-Character Heuristic
The character of the text which does not match with the current character of the pattern is called the bad character. 
- Make a table of the characters
- Make sure the table does not contain repetitive characters. If there are several letters in the pattern, the bad table only contains one letter.
- max(1, lengthOfPattern - actualIndex -1)
- We iterate over the pattern and compute the values to the ad match table. 
- We keep updating the old values for the same character.

Good Suffix Heuristic
If t is the longest suffix of P that matches T in the current position, then P can be shifted so that the previous occurrence of t in P matches T. In fact, it can be required that the character before the previous occurrence of t be different from the character before the occurrence of t as a suffix. If no such previous occurrence of t exists, then the following cases apply:

Find the smallest shift that matches a prefix of the pattern to a suffix of t in the text
If there's no such match, shift the pattern by n (the length of P)

Precompute the shift amounts in two tables
– bad‐symbol table indicates how much to shift based on the text’s character that causes a mismatch
– good‐suffix table indicates how much to shift based on matched part (suffix) of the pattern

The preprocessing phase in O(m+)time and space complexity.The complexity of the searching phase is O(mn).Its best performance complexity is O(n / m).

Boyer-Moore is faster, simpler and optimized the searches of substrings. It has the following uses,
String Analyzers                     
Big Data                                 
Text labeling                          



Quick Search Algorithm

  • simplification of the Boyer-Moore algorithm;
  • uses only the bad-character shift;
  • easy to implement;
  • preprocessing phase in O(m+sigma) time and O(sigma) space complexity;
  • searching phase in O(mn) time complexity;
  • very fast in practice for short patterns and large alphabets.
  • this algorithm starts searching from the left-most character to the right



Finite Automata
A finite automaton(FA) has a finite set of states with which it accepts or rejects strings, which is used in text processing, compilers, and hardware design. Context-free grammar (CFGs) are used in programming languages and artificial intelligence. The Finite automata or finite state machine is used to parse formal languages. This means that finite automata are very usefull in the creation of compiler and interpreter techniques. There are two types of finite automata: DFA(deterministic finite automata) NFA(non-deterministic finite automata).

A Finite Automata consists of the following:                                      

Q : Finite set of states.                                                                      
Σ : set of Input Symbols.                                                                   
q : Initial state.                                                                                  
F : set of Final States.                                                                       
δ : Transition Function.                                                                       

Formal specification of machine is :{ Q, Σ, q, F, δ }                              


1. DFA
DFA, defined as an abstract mathematical concept, refers to deterministic finite automata. Deterministic refers to the uniqueness of the computation, which is often implemented in hardware and software for solving various specific problems  such as lexical analysis and pattern matching. For example, a DFA can model software that decides whether or not online user input such as email addresses are syntactically valid. In the DFA, the machine goes to one state only for a particular input character. DFA does not accept the null move. It has limited memory.
One important thing to note is, there can be many possible DFAs for a pattern. A DFA with a minimum number of states is generally preferred. 

2. NFA
NFA stands for non-deterministic finite automata. It is used to transmit any number of states for a particular input. It can accept the null move.
Nondeterministic computation = Parallel computation
(NFA searches all possible paths in a graph to the accept state)

Some important points about DFA and NFA:
When NFA has multiple choices for the same input symbol, think of it as a process forking multiple processes for parallel computation.
A string is accepted if any of the parallel processes accepts the string.

Nondeterministic computation = Tree of possibilities
(NFA magically guesses a right path to the accept state)

Root of the tree is the start of the computation. Every branching point is the decision-making point consisting of multiple choices.
Machine accepts a string if any of the paths from the root of the tree to a leaf leads to an accept state.
Constructing NFA’s is easier than directly constructing DFA’s for many problems. Hence, construct NFA’s and then convert them to DFA’s.

NFA’s are easier to understand than DFA’s.
Every DFA is NFA, but NFA is not DFA.
There can be multiple final states in both NFA and DFA.
DFA is used in Lexical Analysis in Compiler.
NFA is more of a theoretical concept.

Efficient Construction of Finite Automata
First, we have to fill a 2D array to make the transition table of the finite automata. Once the table is created, the searching procedure is simple. By starting from the first state of the automaton, when we reach the final state, it means that the pattern is found in the string.
For finite automata construction, the time complexity is O(M*K), M is the pattern length and the K is a number of different characters. The complexity of main pattern searching is O(n).