1 month, 2 weeks

Suffix Array




 

A suffix array is a sorted array of all suffixes of a given string. After sorting the suffixes in lexicographical order, we can get the suffix array. Suffix arrays can also be formed using suffix trees. A suffix array can be constructed from Suffix tree by doing a DFS traversal of the suffix tree. In fact Suffix array and suffix tree both can be constructed from each other in linear time. We can also find substrings using suffix array by using binary search type procedure. 

We can use Python bisect module for binary search.

bisect.bisect_left(a,    x): Leftmost offset where we can insert x into a to maintain sorted order. a is already sorted!
bisect.bisect_right(a,    x): Like bisect_left, but returning rightmost instead of leftmost offset.

Time Complexity: O(mlogn) where m is length of the pattern to be searched and n is length of the text. 
Auxiliary Space: O(m+n) 

Advantages of suffix arrays over suffix trees include improved space requirements, simpler linear time construction algorithms (e.g., compared to Ukkonen’s algorithm) and improved cache locality. It is a data structure used in, among others, full text indices, data compression algorithms, and the field of bibliometrics.(Source:Wiki)

Naive method to build Suffix Array: O(n*k*Logn). If we consider a O(nLogn)) algorithm used for sorting. The sorting step itself takes O(n*k*Logn) time as every comparison is a comparison of two strings and the comparison takes O(K) time where K is max length of string in given array. Auxiliary Space: O(n)

 

Longest Common Prefix (LCP) Arrays                                                                                  

The longest common prefix (LCP) array is a commonly used data structure alongside the suffix array (SA). The LCP array stores the length of the longest common prefix between two adjacent suffixes of a given string as they are stored (in lexicographical order) in the SA. A typical use of the combination of the SA and the LCP array is to simulate the suffix tree functionality using less space. 

Four different approaches :

Horizontal Scanning — Find the LCP of strs[0] with strs[1] with strs[2] and so on. 

Vertical Scanning — Scan all the characters at index 0 then index 1 then index 2 and so on.

Divide and Conquer — Divide the strs array to two parts and merge the LCP of both the subparts.

Binary Search — Compare the substrings[0 to mid] of the smallest string with each string and keep on updating front and behind accordingly.

Examples:                                                                                                                
Input: {"programming", "progra", "progr", "prog"}                                                     
Output: "prog"                                                                                                           

 

Kasai’s Algorithm                                                                            

Kasai’s Algorithm is used to get the Longest Common Prefix (LCP) array from suffix array. At first suffix arrays are found. After that Kasai's algorithm takes the suffix array list to find LCP that takes O(m log n), where m is pattern length and n is the length of the text. The Kasai’s Algorithm, it takes  O(n) for searching the pattern in the main string.

Naive Construction of LCP Array
def lcp_naive ( pos , T ):
    lcp = [ -1] # first -1 (at index 0)
    for r in range (1 , len ( T )):
 # compare suffix starting at pos[r -1]
 # to suffix starting at pos[r]
    x = 0
   while T [ pos [r -1]+ x ] == T [ pos [ r ]+ x ]:
       x += 1 # cannot run off the string ( sentinel !)
      lcp.append ( x )
  lcp . append ( -1) # trailing -1 (at index n)
  return lcp

 

Shortest Unique Substring
def shortest_unique_substring ( pos , lcp ):
   n = len ( pos )
# full text ( without sentinel ) is always unique
   best_i = 0
   best_j = n -1
   for r in range ( len ( pos )):
      i = pos [ r ]
      j = i + max ( lcp [ r ] , lcp [ r +1]) + 1
      if j == n : continue
      if (j - i ) < ( best_j - best_i ):
          best_i , best_j = i , j
  return best_i , best_j

 

Code: Longest Common Substring
def longest_common_substring (s , t ):
    T = s + ’#’ + t + ’$’
    pos , lcp = sa_and_lcp ( T )
    lcs = ’’
    for r in range (1 , len ( pos )):
 # do both suffixes start in the same string = > skip r
        if (pos [ r] <= len(s) and pos[r -1] <= len ( s )) or (pos[ r ] > len (s) and pos [r -1] > len (s)):
            continue
        if lcp [ r ] > len ( lcs ):
            lcs = T [ pos [ r ]: pos [ r ]+ lcp [ r ]]
   return lcs

 

Linear Time LCP Construction: Kasai’s Algorithm

Input: Text T, suffix array pos, its inverse rank.

Idea:
-Compare each suffix, starting at text position p = 0, 1, . . . , n − 1, to its respective predecessor (= lexicographically next smaller suffix)
-Get predecessor by using suffix array (pos) and its inverse (rank): For the suffix starting at p, find text position pos[rank[p] − 1].
-Fill in LCP table in rank[p]-order (not from left to right or r-order!)
-Moving from p to p + 1, we keep the computed common prefix, without the first character, similarly to following a suffix link.
-This is what saves us time.

Code: Kasai’s Algorithm
 
def lcp ( pos , T ):
     n = len ( pos )
     lcp = [ -1] * ( n +1)
     rank = invert_sa ( pos )
     l = 0           # current common prefix length
     for p in range (n -1):
         r = rank [ p ]
 # within length of T and chars agree ?
         while (pos [r -1] + l < len ( T )) and ( p + l < len ( T )) and (T [ p + l ] == T [ pos [r -1] + l ]):
             l += 1
         lcp [ r ] = l
         l = max (l -1 , 0)    # next suffix : lose first character
     return lcp

Why Is That Linear Time?
1   for p in range (n -1):
2        r = rank [ p ]
3        while ( pos [r -1] + l < len ( T )) and ( p + l < len ( T )) and ( T [ p + l ] == T [ pos [r -1] + l ]):
4             l += 1
5        lcp [ r ] = l
6        l = max (l -1 , 0)                      # next suffix : lose first character

 

Test in Line 3 can be performed at most 2n times:
    Mismatch: while loop terminated: at most n − 1 times
    Match: l is incremented in Line 4 and can decrease by at most 1 in Line 6
    p increased in Line 1
         → p+l is larger when next reaching Line 3
         → can happen at most n times

The algorithm constructs LCP array from suffix array and input text in O(n) time. 


Responses(0)