5 months, 1 week

# Applications of String Matching Algorithms

String matching strategies or algorithms provide a key role in various real-world problems or applications. A few of its imperative applications are Spell Checkers, Spam Filters, Intrusion Detection System, Search Engines, Plagiarism Detection, Bioinformatics, Digital Forensics and Information Retrieval Systems etc. It is also used in the Data Schema,  Network systems.

String Matching Algorithms can broadly be classified into two types of algorithms :

1. Exact String Matching Algorithms
2. Approximate String Matching Algorithms

Exact String Matching Algorithms

Exact string matching algorithms is to find one, several, or all occurrences of a defined string in a large string (text or sequences) such that each matching is perfect. All alphabets of patterns must be matched to the corresponding matched subsequence. These are further classified into four categories:

1. Algorithms based on character comparison:
• Naive Algorithm
• Knutt Morris Pratt Algorithm (KMP)
• Boyer–Moore Algorithm
• Using the Trie data structure
2. Deterministic Finite Automaton (DFA) method:
• Automaton Matcher Algorithm
3. Algorithms based on Bit (parallelism method):
• Aho-Corasick Algorithm
4. Hashing-string matching algorithms:
• Rabin Karp Algorithm

Approximate String Matching Algorithms

Fuzzy Matching or Approximate String Matching algorithms are the process of finding strings that approximately match a pattern, which is classified into two categories: on-line and off-line. With online algorithms, the pattern can be processed before searching but the text cannot. In other words, on-line techniques do searching without an index.

A fuzzy matching algorithm can operate like a spell checker and spelling-error corrector. For example, if a user types "Misissippi" into Yahoo or Google -- both of which use fuzzy matching -- a list of hits is returned along with the question, "Did you mean Mississippi?" Alternative spellings and words that sound the same but are spelled differently are given.

With the availability of large amounts of DNA data, matching of nucleotide sequences has become an important application. Approximate matching is also used in spam filtering. Record linkage is a common application where records from two disparate databases are matched.

String matching cannot be used for most binary data, such as images and music. They require different algorithms, such as acoustic fingerprinting.,

Fuzzy Matching Techniques

Here’s a list of the various fuzzy matching techniques that are in use today:

• Levenshtein Distance (or Edit Distance)
• Damerau-Levenshtein Distance
• Jaro-Winkler Distance
• Keyboard Distance
• Kullback-Leibler Distance
• Jaccard Index
• Metaphone 3
• Name Variant
• Syllable Alignment
• Acronym

Since fuzzy matching is based on a probabilistic approach to identifying matches, it can offer a wide range of benefits such as:

·        Higher matching accuracy

·        Provides solutions to complex data

·        Easily configurable to effect false positives

·        Better suited to finding matches without a consistent unique identifier

·        Can incorrectly link different entities

·        Difficult to scale across larger datasets

·        Can require considerable testing for validation

• Online visitors can find products or product categories without knowing the correct spellings.
• Tourists can locate cities without knowing exact spellings or by using an approximate spelling in cases where the city name is listed in a different language.
• Scientists can find articles without having to know the exact titles.
• Users can find the names of films or books without knowing the exact titles.
• Users can search for people without knowing the exact spelling of their names.

Fuzzy matching also has the following drawbacks:

• It sometimes returns too many results, requiring the user to sift through all the suggestions to find the one that's most appropriate. Sometimes, the most appropriate result is toward the end of the list, several pages down; and sometimes, none of the results are appropriate.
• It only compensates for typos or misspellings and must be complemented with a semantic algorithm to also consider synonyms and semantics.

Fuzzywuzzy in Python

Fuzzywuzzy is a python library that uses Levenshtein Distance to calculate the differences between sequences and patterns, a service that finds event tickets from all over the internet and showcases them on one platform. The big problem they were facing was the labeling of the same events as stated on their blog

Python libraries such as FuzzyWuzzy can be used to run string matching in an easy and intuitive method. Using the Python Record Linkage Tookit, users can run several indexing methods including sorted neighborhood and blocking and identify duplicates using FuzzyWuzzy. Although Python is easy to use, it can be slower to run matches than other methods.

from fuzzywuzzy import fuzz

str1 = 'California, USA'

str2 = 'California'

ratio = fuzz.ratio(str1, str2)

partial_ratio = fuzz.partial_ratio(str1, str2)

print(ratio)

print(partial_ratio)

The `partial ratio` is `100` while the plain `ratio` is `80` so relying on `partial ratio` in handy in some scenarios.

## Levenshtein distance

The Levenshtein distance between two strings is the number of deletions, insertions and substitutions needed to transform one string into another.

Once you install the `python-Levenshtein` package

pip install python-Levenshtein

import Levenshtein as levenshtein

str1 = 'Context is worth 80 IQ points.'

str2 = 'A change in perspective is worth 80 IQ points'

edit_distance = levenshtein.distance(str1, str2)

similarity_ratio = levenshtein.ratio(str1, str2)

 Algorithm name Comparison Order Preprocess time complexity Search time complexity Main  characteristic Boyer Moore Right to Left O(m+n) O(mn) Use both good suffix shift and bad character shift BM Horspool Not relevant O(m+n) O(mn) Use only bad character shift with rightmost character Brute Force Not relevant No preprocessing O(mn) Use one by one character shift. Not an optimal one KMP Left to Right O(m) O(m+n) Independent of alphabet size, use the notion of border of the string, increases performance, decrease delay and decrease time of comparing Quick Search Not relevant O(m+n) O(mn) Use only bad character shift, very fast Rabin Karp Left to Right O(m) O(mn) Use hashing function, very effective for multiple pattern matching in 1D matching Approximate string matching Not relevant - O(mn) First matching approximate then searching dictionary Needleman Wunsch Left to Right - O(mn) Global alignment of proteins and nucleotides Smith Waterman Left to Right - O(mn) Align local sequence in segments of proteins and nucleotides

Applications of String Matching Algorithms:

Bioinformatics and DNA Sequencing

DNA and protein sequences have a central role in modern biology. The rapidly growing databases of such sequences present a challenge for developing efficient string matching algorithms. The DNA sequences of related species and even individuals within a species can differ slightly, and thus there is a need for approximate searching of the sequences in addition to exact searching.

In many cases, approximate matching is not sufficient to model the complex biological variation present in real sequences. A weighted pattern is one model that has been successfully applied to model. In bioinformatics, the terms position weight matrices, position-specific scoring matrices, or profiles are often used to refer to weighted patterns. String matching algorithms and DNA analysis are both collectively used for finding the occurrence of the pattern set.

Data Scanning

In anti-virus scanning, signatures are defined to describe known computer viruses, and the first task in these applications is to locate these signatures in large amounts of data. When a signature is found, more sophisticated methods are needed to confirm the presence of a computer virus. The rapidly growing set of signatures calls for efficient multiple string matching algorithms. In intrusion detection applications, strings related to attacks are defined. These strings are then searched for in network traffic, and the system is alerted for further inspection if a suspicious sequence of these strings is found.

Plagiarism Detection

Plagiarism has become a growing concern in education. In computer science, a particular problem is the copying of code for programming assignments.Thus, these algorithms are used to detect similarities between them and declare if the work is plagiarized or original.

Image Searching

Searching for images is an extension of string matching to two-dimensional objects. Several algorithms have been presented to solve the exact matching problem. We consider the two-dimensional version of parameterized string matching, which can identify an image even if its color map has been changed.

Digital Forensics

String matching algorithms are used to locate specific text strings of interest in the digital forensic text, which are useful for the investigation.

Spelling Checker

Trie is built based on a predefined set of patterns. Then, this trie is used for string matching. The text is taken as input, and if any such pattern occurs, it is shown by reaching the acceptance state.

Spam filters

Spam filters use string matching to discard the spam. For example, to categorize an email as spam or not, suspected spam keywords are searched in the content of the email by string matching algorithms.

Text Editor, Digital Library and Search Engine

Every person uses a text editor and every user of a digital library or search engine needs to find patterns in a text. The Boyer Moore algorithm is directly implemented the search command of practically all text editors. The longest common subsequence dynamic programming algorithm is implemented in system commands that test differences between files.

Multimedia and Computational Biology

It has shown that a much more generalized theoretical basis of pattern matching could be of tremendous benefit. Pattern matching has to adapt itself to increasingly broader definitions of matching. In computational biology one may be interested in finding a close mutation, in communications one may want to adjust for transmission noise, in texts it may be desirable to allow common typing errors. In multimedia, one may want to adjust for loss compressions, occlusions, scaling, affine transformations or dimension loss. The largest overlap heuristic for finding the shortest common superstring has been used in DNA sequencing. The algorithms that are used in Needleman Wunsch and Smith Waterman.

Medical Tests

The BMH algorithm achieves the best overall results when used with medical tests. This algorithm usually performs at least twice as fast as the other algorithms tested. The time performance of exact string pattern matching can be greatly improved if an efficient algorithm is used. Considering the growing amount of text handled in the electronic patient record, it is worth implementing this efficient algorithm.

String Prefix Matching Problem

This refers to the matching the prefixes of the pattern and the text. It also checks the longest prefix of some given sequence/text. This occurs at the starting of the patterns. It also includes preprocessing of the pattern. KMP algorithm and deterministic sequential comparison model are used to solve this problem. This can be done by assigning the lower and the upper bounds of the prefix to be matched.

Polymorphic String Matching

In this some basic string matching algorithm are combined to make one or more efficient algorithms for further computation as in fusion of the algorithms some functions become easy to define and some of the time consuming parameters need not be used. In this data representation technique used is tree. One example is KMP and Boyer Moore fusion. There combination completes the task in amortized constant time and cost is also equal to equality check cost. The quadratic time is dropped. The features of both the algorithms are combined and used for producing a better functional algorithm.

Network Intrusion Detection System

This problem requires an exact pattern matching technique. This is open source IDS snort. It reduces computational time and higher order of context matching is performed. To solve this Boyer Moore algorithm is used as it needs exact matching of the given pattern. To achieve this tree data structure is used in modified algorithms that convert the bad character shift to good prefix shift and resulting in far better performance.

Denoising of the Pattern

When the text is long it creates noise in the pattern. To remove these denoise technique i.e. DUDE algorithm is used for this. It comprises of simple string matching technique with radix sort for getting a noise free sequence. It has 2 passes: 1st collects the empirical probability distribution and observe the noise pattern within the double sided window and in 2nd pass those noise patterns are replaced by denoise pattern/sequence.

Approximate Similarity Measure

This algorithm is used in real-world applications. This is for finite-length strings. It searches the optimal alignment for pattern searching. It is far-far better than any other algorithms. In the previous similarity measure algorithm the variations in the pattern decreases but now in the new proposed algorithm the numbers of the comparisons are reduced. This is done by picking only those strings from the database whose lengths are either equal to or greater than the pattern to be matched.

Retrieving Music Pattern from Musical Database

When musical note from musical database are to be retrieved then we need string matching. There are four similar techniques for this: edit distance, dice similarity, jaccard similarity and cosine similarity. The musical notes are retrieved by QBE (query by example) approach. So the best scheme for this problem is Levenshtein distance with jaccard similarity. This is an approximate music search technique. As the jaccard similarity performs excellent in passing a query when a pitch change scenario is selected.

String Matching through Two Dimensional Meshes

This is the optimal time algorithm that is used for computational biology, cellular automata. It reduces time spent on the comparisons in string matching by using mesh or 2D matrix. Data partition technique is used here. The process is as find the occurrences of pattern, pattern preprocessing, text searching, decomposing the pattern given and finally creating sparse table. The maximum time will be less than half of the size of the string to be matched. This is a optimal algorithm for the mesh network structure.

See all(3)

Tags

Topics