String matching strategies or algorithms provide a key role in various real-world problems or applications. A few of its imperative applications are Spell Checkers, Spam Filters, Intrusion Detection System, Search Engines, Plagiarism Detection, Bioinformatics, Digital Forensics and Information Retrieval Systems etc. It is also used in the Data Schema, Network systems.
String Matching Algorithms can broadly be classified into two types of algorithms :
Exact String Matching Algorithms
Exact string matching algorithms is to find one, several, or all occurrences of a defined string in a large string (text or sequences) such that each matching is perfect. All alphabets of patterns must be matched to the corresponding matched subsequence. These are further classified into four categories:
Approximate String Matching Algorithms
Fuzzy Matching or Approximate String Matching algorithms are the process of finding strings that approximately match a pattern, which is classified into two categories: on-line and off-line. With online algorithms, the pattern can be processed before searching but the text cannot. In other words, on-line techniques do searching without an index.
A fuzzy matching algorithm can operate like a spell checker and spelling-error corrector. For example, if a user types "Misissippi" into Yahoo or Google -- both of which use fuzzy matching -- a list of hits is returned along with the question, "Did you mean Mississippi?" Alternative spellings and words that sound the same but are spelled differently are given.
With the availability of large amounts of DNA data, matching of nucleotide sequences has become an important application. Approximate matching is also used in spam filtering. Record linkage is a common application where records from two disparate databases are matched.
String matching cannot be used for most binary data, such as images and music. They require different algorithms, such as acoustic fingerprinting.,
Fuzzy Matching Techniques
Here’s a list of the various fuzzy matching techniques that are in use today:
Since fuzzy matching is based on a probabilistic approach to identifying matches, it can offer a wide range of benefits such as:
· Higher matching accuracy
· Provides solutions to complex data
· Easily configurable to effect false positives
· Better suited to finding matches without a consistent unique identifier
· Can incorrectly link different entities
· Difficult to scale across larger datasets
· Can require considerable testing for validation
Fuzzy matching also has the following drawbacks:
Fuzzywuzzy in Python
Fuzzywuzzy is a python library that uses Levenshtein Distance to calculate the differences between sequences and patterns, a service that finds event tickets from all over the internet and showcases them on one platform. The big problem they were facing was the labeling of the same events as stated on their blog.
Python libraries such as FuzzyWuzzy can be used to run string matching in an easy and intuitive method. Using the Python Record Linkage Tookit, users can run several indexing methods including sorted neighborhood and blocking and identify duplicates using FuzzyWuzzy. Although Python is easy to use, it can be slower to run matches than other methods.
from fuzzywuzzy import fuzz
str1 = 'California, USA'
str2 = 'California'
ratio = fuzz.ratio(str1, str2)
partial_ratio = fuzz.partial_ratio(str1, str2)
partial ratio is
100 while the plain
80 so relying on
partial ratio in handy in some scenarios.
The Levenshtein distance between two strings is the number of deletions, insertions and substitutions needed to transform one string into another.
Once you install the
pip install python-Levenshtein
import Levenshtein as levenshtein
str1 = 'Context is worth 80 IQ points.'
str2 = 'A change in perspective is worth 80 IQ points'
edit_distance = levenshtein.distance(str1, str2)
similarity_ratio = levenshtein.ratio(str1, str2)
|Algorithm name||Comparison Order||
Preprocess time complexity
|Search time complexity||Main characteristic|
|Boyer Moore||Right to Left||O(m+n)||O(mn)||Use both good suffix shift and bad character shift|
|BM Horspool||Not relevant||O(m+n)||O(mn)||Use only bad character shift with rightmost character|
|Brute Force||Not relevant||No preprocessing||O(mn)||Use one by one character shift. Not an optimal one|
|KMP||Left to Right||O(m)||O(m+n)||Independent of alphabet size, use the notion of border of the string, increases performance, decrease delay and decrease time of comparing|
|Quick Search||Not relevant||O(m+n)||O(mn)||Use only bad character shift, very fast|
|Rabin Karp||Left to Right||O(m)||O(mn)||Use hashing function, very effective for multiple pattern matching in 1D matching|
|Approximate string matching||Not relevant||-||O(mn)||First matching approximate then searching dictionary|
|Needleman Wunsch||Left to Right||-||O(mn)||Global alignment of proteins and nucleotides|
|Smith Waterman||Left to Right||-||O(mn)||Align local sequence in segments of proteins and nucleotides|
Applications of String Matching Algorithms:
Bioinformatics and DNA Sequencing
DNA and protein sequences have a central role in modern biology. The rapidly growing databases of such sequences present a challenge for developing efficient string matching algorithms. The DNA sequences of related species and even individuals within a species can differ slightly, and thus there is a need for approximate searching of the sequences in addition to exact searching.
In many cases, approximate matching is not sufficient to model the complex biological variation present in real sequences. A weighted pattern is one model that has been successfully applied to model. In bioinformatics, the terms position weight matrices, position-specific scoring matrices, or profiles are often used to refer to weighted patterns. String matching algorithms and DNA analysis are both collectively used for finding the occurrence of the pattern set.
In anti-virus scanning, signatures are defined to describe known computer viruses, and the first task in these applications is to locate these signatures in large amounts of data. When a signature is found, more sophisticated methods are needed to confirm the presence of a computer virus. The rapidly growing set of signatures calls for efficient multiple string matching algorithms. In intrusion detection applications, strings related to attacks are defined. These strings are then searched for in network traffic, and the system is alerted for further inspection if a suspicious sequence of these strings is found.
Plagiarism has become a growing concern in education. In computer science, a particular problem is the copying of code for programming assignments.Thus, these algorithms are used to detect similarities between them and declare if the work is plagiarized or original.
Searching for images is an extension of string matching to two-dimensional objects. Several algorithms have been presented to solve the exact matching problem. We consider the two-dimensional version of parameterized string matching, which can identify an image even if its color map has been changed.
String matching algorithms are used to locate specific text strings of interest in the digital forensic text, which are useful for the investigation.
Trie is built based on a predefined set of patterns. Then, this trie is used for string matching. The text is taken as input, and if any such pattern occurs, it is shown by reaching the acceptance state.
Spam filters use string matching to discard the spam. For example, to categorize an email as spam or not, suspected spam keywords are searched in the content of the email by string matching algorithms.
Text Editor, Digital Library and Search Engine
Every person uses a text editor and every user of a digital library or search engine needs to find patterns in a text. The Boyer Moore algorithm is directly implemented the search command of practically all text editors. The longest common subsequence dynamic programming algorithm is implemented in system commands that test differences between files.
Multimedia and Computational Biology
It has shown that a much more generalized theoretical basis of pattern matching could be of tremendous benefit. Pattern matching has to adapt itself to increasingly broader definitions of matching. In computational biology one may be interested in finding a close mutation, in communications one may want to adjust for transmission noise, in texts it may be desirable to allow common typing errors. In multimedia, one may want to adjust for loss compressions, occlusions, scaling, affine transformations or dimension loss. The largest overlap heuristic for finding the shortest common superstring has been used in DNA sequencing. The algorithms that are used in Needleman Wunsch and Smith Waterman.
The BMH algorithm achieves the best overall results when used with medical tests. This algorithm usually performs at least twice as fast as the other algorithms tested. The time performance of exact string pattern matching can be greatly improved if an efficient algorithm is used. Considering the growing amount of text handled in the electronic patient record, it is worth implementing this efficient algorithm.
String Prefix Matching Problem
This refers to the matching the prefixes of the pattern and the text. It also checks the longest prefix of some given sequence/text. This occurs at the starting of the patterns. It also includes preprocessing of the pattern. KMP algorithm and deterministic sequential comparison model are used to solve this problem. This can be done by assigning the lower and the upper bounds of the prefix to be matched.
Polymorphic String Matching
In this some basic string matching algorithm are combined to make one or more efficient algorithms for further computation as in fusion of the algorithms some functions become easy to define and some of the time consuming parameters need not be used. In this data representation technique used is tree. One example is KMP and Boyer Moore fusion. There combination completes the task in amortized constant time and cost is also equal to equality check cost. The quadratic time is dropped. The features of both the algorithms are combined and used for producing a better functional algorithm.
Network Intrusion Detection System
This problem requires an exact pattern matching technique. This is open source IDS snort. It reduces computational time and higher order of context matching is performed. To solve this Boyer Moore algorithm is used as it needs exact matching of the given pattern. To achieve this tree data structure is used in modified algorithms that convert the bad character shift to good prefix shift and resulting in far better performance.
Denoising of the Pattern
When the text is long it creates noise in the pattern. To remove these denoise technique i.e. DUDE algorithm is used for this. It comprises of simple string matching technique with radix sort for getting a noise free sequence. It has 2 passes: 1st collects the empirical probability distribution and observe the noise pattern within the double sided window and in 2nd pass those noise patterns are replaced by denoise pattern/sequence.
Approximate Similarity Measure
This algorithm is used in real-world applications. This is for finite-length strings. It searches the optimal alignment for pattern searching. It is far-far better than any other algorithms. In the previous similarity measure algorithm the variations in the pattern decreases but now in the new proposed algorithm the numbers of the comparisons are reduced. This is done by picking only those strings from the database whose lengths are either equal to or greater than the pattern to be matched.
Retrieving Music Pattern from Musical Database
When musical note from musical database are to be retrieved then we need string matching. There are four similar techniques for this: edit distance, dice similarity, jaccard similarity and cosine similarity. The musical notes are retrieved by QBE (query by example) approach. So the best scheme for this problem is Levenshtein distance with jaccard similarity. This is an approximate music search technique. As the jaccard similarity performs excellent in passing a query when a pitch change scenario is selected.
String Matching through Two Dimensional Meshes
This is the optimal time algorithm that is used for computational biology, cellular automata. It reduces time spent on the comparisons in string matching by using mesh or 2D matrix. Data partition technique is used here. The process is as find the occurrences of pattern, pattern preprocessing, text searching, decomposing the pattern given and finally creating sparse table. The maximum time will be less than half of the size of the string to be matched. This is a optimal algorithm for the mesh network structure.