5 months, 2 weeks

# Rabin–Karp algorithm

Rabin–Karp algorithm or Karp–Rabin algorithm is a string-searching algorithm created by Richard M. Karp and Michael O. Rabin in 1987 that uses hashing to find an exact match of a pattern string in a text. It also checks the pattern by moving window one by one, but without checking all characters for all cases, it finds the hash value. A hash function is a tool to map a larger input value to a smaller output value. This output value is called the hash value. When the hash value is matched, then only it tries to check each character.

• The Rabin-Karp algorithm calculates a hash value for the pattern, and for each M-character subsequence of text to be compared.
• If the hash values are unequal, the algorithm will calculate the hash value for next M-character sequence.
• If the hash values are equal, the algorithm will compare the pattern and the M-character sequence.
• In this way, there is only one comparison per text subsequence, and character matching is only needed when hash values match.

We'll determine hash value using this formula:
(1st letter) X (prime) + (2nd letter) X (prime)¹ + (3rd letter) X (prime)² X + ......

If M is large, then the resulting value (b^M) will be enormous. For this reason, we hash the value by taking it mod a prime number q.

The pseudo-code will be:

Hash Calculation:

Procedure Calculate-Hash(String, Prime, x):
hash := 0                                  // Here x denotes the length to be considered
for m from 1 to x                          // to find the hash value
hash := hash + (Value of String[m])ᵐ⁻¹
end for
Return hash

Hash Recalculation:

Procedure Recalculate-Hash(String, Curr, Prime, Hash):
Hash := Hash - Value of String[Curr]   //here Curr denotes First Letter of Previous String
Hash := Hash / Prime
m := String.length
New := Curr + m - 1
Hash := Hash + (Value of String[New])ᵐ⁻¹
Return Hash

String Match:

Procedure String-Match(Text, Pattern, m):
for i from m to Pattern-length + m - 1
if Text[i] is not equal to Pattern[i]
Return false
end if
end for
Return true

Rabin-Karp:

Procedure Rabin-Karp(Text, Pattern, Prime):
m := Pattern.Length
HashValue := Calculate-Hash(Pattern, Prime, m)
CurrValue := Calculate-Hash(Pattern, Prime, m)
for i from 1 to Text.length - m
if HashValue == CurrValue and String-Match(Text, Pattern, i) is true
Return i
end if
CurrValue := Recalculate-Hash(String, i+1, Prime, CurrValue)
end for
Return -1
If the algorithm doesn't find any match, it simply returns -1.

def rabin_karp(pattern, text):
d = 256      # d is the number of characters in the input alphabet
q = 467      # prime number
M = len(pattern)
N = len(text)
p = 0        # hash for pattern
t = 0        # hash for text
h = 1
i = 0
j = 0
res = []
for i in xrange(M - 1):
h = (h*d) % q
for i in xrange(M):
p = (d*p + ord(pattern[i])) % q
t = (d*t + ord(text[i])) % q
for i in xrange(N - M + 1):
if p == t:
for j in xrange(M):
if text[i + j] != pattern[j]:
break
j += 1
if j == M:
res.append(i)
if i < N - M:
t = (d*(t-ord(text[i])*h) + ord(text[i + M])) % q
if t < 0:
t = t + q
return res

range() – This returns a range object (a type of iterable).
xrange() – This function returns the generator object that can be used to display numbers only by looping. The only particular range is displayed on demand and hence called “lazy evaluation“. (Python 2) Execution speed is faster. Takes less memory as it keeps only one element at a time in memory.

When the hash value of the pattern matches with the hash value of a window of the text but the window is not the actual pattern then it is called a spurious hit. Spurious hit increases the time complexity of the algorithm. In order to minimize spurious hit, we use modulus. It greatly reduces the spurious hit.

Assume the text is length n and the length of the word is m. The best- and average-case running time of Rabin-Karp is O(m + n)  because the rolling hash step takes O(n) time and once the algorithm finds a potential match, it must verify each letter to make sure that the match is true and not a result of a hashing collision and therefore must check each of the mm letters in the word.

The worst-case running time of Rabin-Karp is O(nm). This would occur with an extremely awful hash function that resulted in a false positive at each step. Since whenever the algorithm thinks it found a match, it must verify each of the mm letters in the word, if there is a collision at each step, mm letters will be checked nn times resulting in a running time of O(nm)O. This can be avoided with a good choice of hash function.

Rabin-Karp algorithm is used in detecting plagiarism. Given source material, the algorithm can rapidly search through a paper for instances of sentences from the source material, ignoring details such as case and punctuation. Because of the abundance of the sought strings, single-string searching algorithms are impractical here. Knuth-Morris-Pratt algorithm or Boyer-Moore String Search algorithm is faster single pattern string searching algorithm than Rabin-Karp algorithm. However, it is an algorithm of choice for multiple pattern search. If we want to find any of the large number, say k, fixed length patterns in a text, we can create a simple variant of the Rabin-Karp algorithm.

"String matching algorithms are used widely in computer science that is a very important issue in text processing. Also, string matching algorithms are used as basic components in the practical software in operating systems. A lot of research is performed in this field that hash based matching algorithms like Rabin-Karp is one of their fastest, but due to the rapid growth of data volumes, traditional algorithms are not sufficient. To solve this problem, combining Bloom filter features with string matching algorithms, makes it possible to reduce the execution time. In present paper, a modified version of the Rabin-Karp algorithm is presented by using a Bloom filter as preprocessing phase that can early detect pattern absence. In addition, this approach maximize speed up and reduce the number of patterns needed for the exact Rabin-Karp matching phase. "

Bloom Filter is a space-efficient probabilistic data structure which is used to search an element within a large set of elements in constant time that is O(K) where K is the number of hash functions being used in Bloom Filter. This is useful in cases where: the data to be searched is large. Cassandra uses bloom filters to optimize the search of data in an SSTable on the disk. CDNs use bloom filters to avoid caching items that are rarely searched.

Applications:
- For pattern matching
- For searching string in a bigger text
- Text processing
- Bioinformatics
- Compression

See all(3)

Tags

Topics
Related