- Source: Generalized suffix array
In computer science, a generalized suffix array (GSA) is a suffix array containing all suffixes for a set of strings. Given the set of strings
S
=
S
1
,
S
2
,
S
3
,
.
.
.
,
S
k
{\displaystyle S=S_{1},S_{2},S_{3},...,S_{k}}
of total length
n
{\displaystyle n}
, it is a lexicographically sorted array of all suffixes of each string in
S
{\displaystyle S}
. It is primarily used in bioinformatics and string processing.
Functionality
The functionality of a generalized suffix array is as follows:
For a collection or set of strings,
S
=
S
1
,
S
2
,
S
3
,
.
.
.
,
S
k
{\displaystyle S=S_{1},S_{2},S_{3},...,S_{k}}
.
It is a lexicographically sorted array of all suffixes of each string in the set
S
{\displaystyle S}
.
In the array, each suffix is represented by an integer pair
(
i
,
j
)
{\displaystyle (i,j)}
which denotes the suffix starting from position
j
{\displaystyle j}
in
s
i
{\displaystyle s_{i}}
.
In the case where different strings in
S
{\displaystyle S}
have identical suffixes, in the generalized suffix array, those suffixes will occupy consecutive positions. However, for convenience, the exception can be made where repeats will not be listed.
A generalized suffix array can be generated for a generalized suffix tree. When compared to a generalized suffix tree, while the generalized suffix array will require more time to construct, it will use less space than the tree.
Construction Algorithms and Implementations
Algorithms and tools for constructing a generalized suffix array include:
Fei Shi's (1996) algorithm which runs in
O
(
N
log
n
)
{\displaystyle {\mathcal {O}}(N\log n)}
worst case time and
O
(
N
)
{\displaystyle {\mathcal {O}}(N)}
space, where
N
{\displaystyle N}
is the sum of the lengths of all strings in
S
{\displaystyle S}
and
n
{\displaystyle n}
the length of the longest string in
S
{\displaystyle S}
. This includes sorting, searching and finding the longest common prefixes.
The external generalized enhanced suffix array, or eGSA, construction algorithm which specializes in external memory construction, is particularly useful when the size of the input collection or data structure is larger than the amount of available internal memory
gsufsort is an open-source, fast, portable and lightweight tool for the construction of generalized suffix arrays and related data structures like Burrows–Wheeler transform or LCP Array)
Mnemonist, a collection of data structures implemented in JavaScript contains an implementation for a generalized suffix tree and can be found publicly on npm and GitHub.
Solving the Pattern Matching Problem
Generalized suffix arrays can be used to solve the pattern matching problem:
Given a pattern
P
{\displaystyle P}
and a text
T
{\displaystyle T}
, find all occurrences of
P
{\displaystyle P}
in
T
{\displaystyle T}
.
Using the generalized suffix array
G
{\displaystyle G}
of
T
{\displaystyle T}
, then first, the suffixes that have
P
{\displaystyle P}
as a prefix need to be found.
Since
G
{\displaystyle G}
is a lexicographically sorted array of the suffixes of
T
{\displaystyle T}
, then all such suffixes will appear in consecutive positions within
G
{\displaystyle G}
. Particularly important, since
G
{\displaystyle G}
is sorted, it makes identification of suffixes possible and easy using binary search.
Using binary search, first find the smallest index
i
{\displaystyle i}
in
G
{\displaystyle G}
such that
s
u
f
f
G
[
i
]
{\displaystyle suff_{G}[i]}
contains
P
{\displaystyle P}
as a prefix, or determine that no such suffix is present. In the case where the suffix is not found,
P
{\displaystyle P}
does not occur in
T
{\displaystyle T}
. Otherwise, find the largest index
j
(
≥
i
)
{\displaystyle j(\geq i)}
which contains
P
{\displaystyle P}
as a prefix. The elements in the range
G
[
i
.
.
j
]
{\displaystyle G[i..j]}
indicate the starting positions of the occurrences of
P
{\displaystyle P}
in
T
{\displaystyle T}
.
Binary search on
G
{\displaystyle G}
takes
Θ
(
l
o
g
n
)
{\displaystyle \Theta (logn)}
comparisons.
P
{\displaystyle P}
is compared with a suffix to determine their lexicographic order in each comparison that is done. Thus, this requires comparing at most
|
P
|
=
m
{\displaystyle |P|=m}
characters. Note that a
l
c
p
{\displaystyle lcp}
array is not required, but will offer the benefit of a lower running time.
The runtime of the algorithm is
Θ
(
m
l
o
g
n
)
{\displaystyle \Theta (mlogn)}
. By comparison, solving this problem using suffix trees takes
Θ
(
m
)
{\displaystyle \Theta (m)}
time. Note that with a generalized suffix array, the space required is smaller compared to a suffix tree, since the algorithm only requires space for
n
{\displaystyle n}
words and the space to store the string. As mentioned above, by optionally keeping track of
l
c
p
{\displaystyle lcp}
information which will use slightly more space, the running time of the algorithm can be improved to
Θ
(
m
+
l
o
g
n
)
{\displaystyle \Theta (m+logn)}
.
Other Applications
A generalized suffix array can be utilized to compute the longest common subsequence of all the strings in a set or collection. A naive implementation would compute the largest common subsequence of all the strings in the set in
Θ
(
n
2
)
{\displaystyle \Theta (n^{2})}
.
A generalized suffix array can be utilized to find the longest previous factor array, a concept central to text compression techniques and in the detection of motifs and repeats
See also
Suffix Tree
Suffix Array
Generalized Suffix Tree
Pattern matching problem
Bioinformatics
References
External links
Generalized enhanced suffix array construction in external memory
Kata Kunci Pencarian:
- Generalized suffix array
- Suffix array
- Generalized suffix tree
- Suffix tree
- Longest common substring
- Suffix automaton
- Rope (data structure)
- String-searching algorithm
- C++11
- Wavelet Tree