Dotted Suffix Trees A Structure for Approximate Text Indexing
String Processing and Information Retrieval (2006)
- DOI: 10.1007/11880561_27
Available from
Luis Pedro Coelho's profile on Mendeley.
or
Author-supplied keywords
Page 1
Dotted Suffix Trees A Structure for Approximate Text Indexing
Dotted Suffix Trees
A Structure for Approximate Text Indexing
Lu´ıs Pedro Coelho and Arlindo L. Oliveira
INESC-ID/IST
{luis, aml}@algos.inesc-id.pt
Abstract. In this work, the problem we address is text indexing for ap-
proximate matching. Given a text T which undergoes some preprocessing
to generate an index, we can later query this index to identify the places
where a string occurs up to a certain number of errors k (edition dis-
tance). The indexing structure occupies space O(n logk n) in the average
case, independent of alphabet size. This structure can be used to report
the existence of a match with k errors in O(3kmk+1) and to report the
occurrences in O(3kmk+1+ed) time, where m is the length of the pattern
and ed and the number of matching edit scripts. The construction of the
structure has time bound by O(kN |Σ|), where N is the number of nodes
in the index and |Σ| the alphabet size.
Keywords: string algorithms, suffix trees, approximate text matching,
text indexing.
1 Introduction
Since their introduction [1], suffix trees have been one of the methods of choice
for text indexing. However, in many real-life problems one is interested in finding
places in the text where an approximate form of the pattern occurs. In 2001,
Navarro et al presented a survey of existing approaches to solving this prob-
lem [2]. More recently, Maaß [3] presents both a survey of other work and his
own solution, which occupies, on average, O(|Σ|kn logk n)1 space for a search
time O(m). In this work we present an approach based on an extension of suffix
trees. The main advantage of this approach is that both the search and the index
size are alphabet independent (although the indexing time is not).
The structure presented here is superficially very similar to the one presented
by Chattaraj [4] as an inexact suffix tree, but that work has different objectives.
Cole el al [5] present a structure whose initial intuition resemble ours in that
it involves error trees. However, they make different time and space tradeoffs
to achieve O(m + log log n + occ) searching (Hamming distance) with a size
O(n log
k n
k! ) index.
Supported by the Portuguese Science and Technology Foundation, project
posi/sri/47778/2002 BioGrid.
1 Maaß considers |Σ| and k as constant and presents O(n logk n) as a complexity
result. However, this analysis ignores the potentially large impact of alphabet size.
F. Crestani, P. Ferragina, and M. Sanderson (Eds.): SPIRE 2006, LNCS 4209, pp. 329–336, 2006.
c
© Springer-Verlag Berlin Heidelberg 2006
A Structure for Approximate Text Indexing
Lu´ıs Pedro Coelho and Arlindo L. Oliveira
INESC-ID/IST
{luis, aml}@algos.inesc-id.pt
Abstract. In this work, the problem we address is text indexing for ap-
proximate matching. Given a text T which undergoes some preprocessing
to generate an index, we can later query this index to identify the places
where a string occurs up to a certain number of errors k (edition dis-
tance). The indexing structure occupies space O(n logk n) in the average
case, independent of alphabet size. This structure can be used to report
the existence of a match with k errors in O(3kmk+1) and to report the
occurrences in O(3kmk+1+ed) time, where m is the length of the pattern
and ed and the number of matching edit scripts. The construction of the
structure has time bound by O(kN |Σ|), where N is the number of nodes
in the index and |Σ| the alphabet size.
Keywords: string algorithms, suffix trees, approximate text matching,
text indexing.
1 Introduction
Since their introduction [1], suffix trees have been one of the methods of choice
for text indexing. However, in many real-life problems one is interested in finding
places in the text where an approximate form of the pattern occurs. In 2001,
Navarro et al presented a survey of existing approaches to solving this prob-
lem [2]. More recently, Maaß [3] presents both a survey of other work and his
own solution, which occupies, on average, O(|Σ|kn logk n)1 space for a search
time O(m). In this work we present an approach based on an extension of suffix
trees. The main advantage of this approach is that both the search and the index
size are alphabet independent (although the indexing time is not).
The structure presented here is superficially very similar to the one presented
by Chattaraj [4] as an inexact suffix tree, but that work has different objectives.
Cole el al [5] present a structure whose initial intuition resemble ours in that
it involves error trees. However, they make different time and space tradeoffs
to achieve O(m + log log n + occ) searching (Hamming distance) with a size
O(n log
k n
k! ) index.
Supported by the Portuguese Science and Technology Foundation, project
posi/sri/47778/2002 BioGrid.
1 Maaß considers |Σ| and k as constant and presents O(n logk n) as a complexity
result. However, this analysis ignores the potentially large impact of alphabet size.
F. Crestani, P. Ferragina, and M. Sanderson (Eds.): SPIRE 2006, LNCS 4209, pp. 329–336, 2006.
c
© Springer-Verlag Berlin Heidelberg 2006
Page 2
330 L.P. Coelho and A.L. Oliveira
2 The Indexing Structure
Definition 1 (Character, string). Given a set Σ, we say that S is a string
over Σ if S is a (possibly empty) sequence of elements of Σ. Elements of Σ will
be called characters. The length of the string S will be denoted by |S|. We shall
write Si for the i-th element of S.
The set of all strings is denoted by Σ∗ and Σ+ = Σ∗ − {empty string}.
For denoting characters we shall use letters from the beginning of the roman
alphabet (a, b, c,. . . ) and, for strings, we shall use letters from the end of the
alphabet (w, x, . . . ). In what follows we assume that there are two special
symbols ($ and .) which are not part of Σ.
Definition 2 (Concatenation, Prefix and Suffix). wx or aw will denote
the usual concatenation operation. If S = wxy, then w is a prefix of S, x is a
substring of S and y is a suffix of S (at position |wx|).
Definition 3 (Patricia tree, Suffix Tree, Suffix Link). T is a Patricia tree
if T is a rooted tree with edge labels from Σ+. For each a ∈ Σ and every node
n in T , there exists at most one edge leaving n whose label starts with a. Each
node in a Patricia tree has a path leading to it which forms a string. If the node
n has the leading path w, we shall also refer to n as w. A compact Patricia tree
omits nodes with just one child.
A suffix tree for a string S is a compact Patricia tree whose leaf nodes (those
without children) have paths corresponding to all suffixes of the string S$. A
suffix link in a suffix tree is a link from the node aw to the node w. This link has
the label a.
In a suffix tree, all internal nodes have a well defined suffix link. McCreight’s [6]
algorithm constructs a suffix tree with suffix links in linear time.
Definition 4 (Occurrence Set, Position Set). Given a node w in a suffix
tree, we call its occurrence set the set of indexes in the original string where the
string w occurs.
Given a node w in a suffix tree, its position set is the set formed by taking its
occurrence set and adding the length of w to each element.
Lemma 1 (Position set at the suffix node). Given two nodes aw and w, if
one takes the position set of w, subtracts one from each element, one obtains a
superset of the position set of w. The items shared by both sets are those positions
of the string which contain an a.
The lemma is fairly obvious given that the position set of aw contains all the
positions where aw occurs which are exactly those positions where w occurs
preceded by a.
Definition 5 (Aproximate Match). We say that the string s matches the
string t at position p with k errors if we can make k modifications in s to obtain s′
which is a substring of t at position p. A modification is either deletion, insertion
or substitution of one character.
2 The Indexing Structure
Definition 1 (Character, string). Given a set Σ, we say that S is a string
over Σ if S is a (possibly empty) sequence of elements of Σ. Elements of Σ will
be called characters. The length of the string S will be denoted by |S|. We shall
write Si for the i-th element of S.
The set of all strings is denoted by Σ∗ and Σ+ = Σ∗ − {empty string}.
For denoting characters we shall use letters from the beginning of the roman
alphabet (a, b, c,. . . ) and, for strings, we shall use letters from the end of the
alphabet (w, x, . . . ). In what follows we assume that there are two special
symbols ($ and .) which are not part of Σ.
Definition 2 (Concatenation, Prefix and Suffix). wx or aw will denote
the usual concatenation operation. If S = wxy, then w is a prefix of S, x is a
substring of S and y is a suffix of S (at position |wx|).
Definition 3 (Patricia tree, Suffix Tree, Suffix Link). T is a Patricia tree
if T is a rooted tree with edge labels from Σ+. For each a ∈ Σ and every node
n in T , there exists at most one edge leaving n whose label starts with a. Each
node in a Patricia tree has a path leading to it which forms a string. If the node
n has the leading path w, we shall also refer to n as w. A compact Patricia tree
omits nodes with just one child.
A suffix tree for a string S is a compact Patricia tree whose leaf nodes (those
without children) have paths corresponding to all suffixes of the string S$. A
suffix link in a suffix tree is a link from the node aw to the node w. This link has
the label a.
In a suffix tree, all internal nodes have a well defined suffix link. McCreight’s [6]
algorithm constructs a suffix tree with suffix links in linear time.
Definition 4 (Occurrence Set, Position Set). Given a node w in a suffix
tree, we call its occurrence set the set of indexes in the original string where the
string w occurs.
Given a node w in a suffix tree, its position set is the set formed by taking its
occurrence set and adding the length of w to each element.
Lemma 1 (Position set at the suffix node). Given two nodes aw and w, if
one takes the position set of w, subtracts one from each element, one obtains a
superset of the position set of w. The items shared by both sets are those positions
of the string which contain an a.
The lemma is fairly obvious given that the position set of aw contains all the
positions where aw occurs which are exactly those positions where w occurs
preceded by a.
Definition 5 (Aproximate Match). We say that the string s matches the
string t at position p with k errors if we can make k modifications in s to obtain s′
which is a substring of t at position p. A modification is either deletion, insertion
or substitution of one character.
Page 3
Dotted Suffix Trees: A Structure for Approximate Text Indexing 331
Definition 6 (Error Tree). For any node w, its error tree is formed by taking
its position set, adding one to each element and forming the Patricia tree of the
suffixes starting at those positions. If the position set includes the end of the
string, that element is removed.
The leaves are labeled by the position of the string in which their paths occur
minus |w| + 1.
Definition 7 (1-error dotted Tree). A 1-error dotted tree is the tree which
is formed by adding to each node in a suffix tree, a new edge labeled by · which
points to its error tree. The edge labeled · shall be called a dot link.
1
i$
10
7
4
sippi$
pi$
9
i$
10
$
3
6
7
21 3
6
5
si
i
8 9
10
10
ppi$
ssippi$
i s
p
pi$
i$
ssi
$ ppi$
ppi$
4
ppi$
$
ssippi$
ssippi$
ssi
8
i
i
2
5
$
s
6
si
3
s
s
si
8
i ss
sippi$
ppi$
5
2
5
2
6
3
i
4
7
ppi$
ssippi$
ppi$
ssippi$
ppi$
ppi$
ssippi$
ssippi$
ppi$
ssippi$
11
pi$
pi$
p
pi$
9
$
11i
m
ississippi$
sippi$
pi$
ssippi$
7
ppi$
4
Fig. 1. 1-error dotted tree for mississippi
The 1-error dotted tree for mississippi is shown in Figure 1. The nodes are
connected to their error trees by thick diagonal links. We can see some examples
of the concepts above: for the node issi, the occurrence set is {2, 5} and its
position set is {6, 9}. In a sense, one can say that being at node issi is being at
positions 6 and 9 simultaneously. The error tree (at issi) is formed by taking
the strings starting at positions {7, 10} (ie, sippi$ and pi$) in a Patricia tree. In
a leaf, the occurrence set is a singleton, and we label the leaf by its element.
The paths in the dotted tree are paths in the extended alphabet Σ ∪ {., $}.
The notions of occurrence set, position set and error tree are valid for all nodes
in a dotted tree.
Definition 8 (k-error dotted tree). We define a k-error dotted tree as the
tree obtained by adding error trees to each node in the (k − 1)-error dotted tree
which does not already contain one.
Definition 6 (Error Tree). For any node w, its error tree is formed by taking
its position set, adding one to each element and forming the Patricia tree of the
suffixes starting at those positions. If the position set includes the end of the
string, that element is removed.
The leaves are labeled by the position of the string in which their paths occur
minus |w| + 1.
Definition 7 (1-error dotted Tree). A 1-error dotted tree is the tree which
is formed by adding to each node in a suffix tree, a new edge labeled by · which
points to its error tree. The edge labeled · shall be called a dot link.
1
i$
10
7
4
sippi$
pi$
9
i$
10
$
3
6
7
21 3
6
5
si
i
8 9
10
10
ppi$
ssippi$
i s
p
pi$
i$
ssi
$ ppi$
ppi$
4
ppi$
$
ssippi$
ssippi$
ssi
8
i
i
2
5
$
s
6
si
3
s
s
si
8
i ss
sippi$
ppi$
5
2
5
2
6
3
i
4
7
ppi$
ssippi$
ppi$
ssippi$
ppi$
ppi$
ssippi$
ssippi$
ppi$
ssippi$
11
pi$
pi$
p
pi$
9
$
11i
m
ississippi$
sippi$
pi$
ssippi$
7
ppi$
4
Fig. 1. 1-error dotted tree for mississippi
The 1-error dotted tree for mississippi is shown in Figure 1. The nodes are
connected to their error trees by thick diagonal links. We can see some examples
of the concepts above: for the node issi, the occurrence set is {2, 5} and its
position set is {6, 9}. In a sense, one can say that being at node issi is being at
positions 6 and 9 simultaneously. The error tree (at issi) is formed by taking
the strings starting at positions {7, 10} (ie, sippi$ and pi$) in a Patricia tree. In
a leaf, the occurrence set is a singleton, and we label the leaf by its element.
The paths in the dotted tree are paths in the extended alphabet Σ ∪ {., $}.
The notions of occurrence set, position set and error tree are valid for all nodes
in a dotted tree.
Definition 8 (k-error dotted tree). We define a k-error dotted tree as the
tree obtained by adding error trees to each node in the (k − 1)-error dotted tree
which does not already contain one.
Page 4
332 L.P. Coelho and A.L. Oliveira
3 Searching
Given a pattern to search for, we follow it character by character, descending the
tree. We represent this walk by keeping a node and an offset from the start of its
incoming link. Inside an edge, we consider that there is an implicit dot link which
goes forward one character. At each point, we can take four possible actions: (1)
match, where we descend according to the pattern (may not be possible); (2)
substitution, where we follow the dot link (possibly implicit), moving in the
pattern; (3) insertion, where we follow the (possibly implicit) dot link, not
moving in the pattern; (4) deletion, where we advance in the pattern, while
not moving in the tree. We limit ourselves to at most k non-matching operations
(editions). Algorithm 1 implements the process just described.
Algorithm 1. Function findString(w, offset, s, k)
Input: Current node w
Input: Current offset offset
Input: String s
Input: Maximum errors k
Data: The tree’s string treeString
if k < 0 then return string not found1
if s is empty then report all w’s children2
findString(w,offset,s + 1,k − 1)// deletion3
if offset = length(w) then4
findString(w.dotLink, 0, s,k − 1)// insertion5
findString(w.dotLink, 0, s + 1,k − 1)// substituition6
child ←w.getSon(s0)// try matching7
if child isn’t null then findString(child, 0, s + 1,k)8
else9
findString(w,offset +1,s,k-1)// insertion10
if s0 = treeStringstart(w)+offset then k ←k − 111
findString(w,offset +1,s + 1,k)// either match or substituition12
There are at most
∑k
i=1
(m
i
)
= O(mk) ways to combine k edit operations
into a string of size m. Since there are 3 operations (substitution, insertion, and
deletion), we have at most O(3kmk) sequences. Each sequence has at most m +
k = O(m) elements and therefore the total time to find matches is O(3kmk+1).
Once a match has been found in the tree, reporting the leaves below the node
takes time proportional to the number of leaves, ie. to the number of edit scripts
which can be used to match the pattern to a substring of the text (which can be
greater than the number of occurrences).2 The total search time is O(3kmk+1
+ ed).
2 As often happens, strings of the form am serve as examples of pathological behaviour
as they can match any position of a string of form an in a large number of ways.
3 Searching
Given a pattern to search for, we follow it character by character, descending the
tree. We represent this walk by keeping a node and an offset from the start of its
incoming link. Inside an edge, we consider that there is an implicit dot link which
goes forward one character. At each point, we can take four possible actions: (1)
match, where we descend according to the pattern (may not be possible); (2)
substitution, where we follow the dot link (possibly implicit), moving in the
pattern; (3) insertion, where we follow the (possibly implicit) dot link, not
moving in the pattern; (4) deletion, where we advance in the pattern, while
not moving in the tree. We limit ourselves to at most k non-matching operations
(editions). Algorithm 1 implements the process just described.
Algorithm 1. Function findString(w, offset, s, k)
Input: Current node w
Input: Current offset offset
Input: String s
Input: Maximum errors k
Data: The tree’s string treeString
if k < 0 then return string not found1
if s is empty then report all w’s children2
findString(w,offset,s + 1,k − 1)// deletion3
if offset = length(w) then4
findString(w.dotLink, 0, s,k − 1)// insertion5
findString(w.dotLink, 0, s + 1,k − 1)// substituition6
child ←w.getSon(s0)// try matching7
if child isn’t null then findString(child, 0, s + 1,k)8
else9
findString(w,offset +1,s,k-1)// insertion10
if s0 = treeStringstart(w)+offset then k ←k − 111
findString(w,offset +1,s + 1,k)// either match or substituition12
There are at most
∑k
i=1
(m
i
)
= O(mk) ways to combine k edit operations
into a string of size m. Since there are 3 operations (substitution, insertion, and
deletion), we have at most O(3kmk) sequences. Each sequence has at most m +
k = O(m) elements and therefore the total time to find matches is O(3kmk+1).
Once a match has been found in the tree, reporting the leaves below the node
takes time proportional to the number of leaves, ie. to the number of edit scripts
which can be used to match the pattern to a substring of the text (which can be
greater than the number of occurrences).2 The total search time is O(3kmk+1
+ ed).
2 As often happens, strings of the form am serve as examples of pathological behaviour
as they can match any position of a string of form an in a large number of ways.
Page 5
Dotted Suffix Trees: A Structure for Approximate Text Indexing 333
4 Constructing the Dotted Tree
We start with a suffix tree and show first how to construct a one-error dotted
tree. We construct the error tree for the root which is almost a copy of the entire
tree, except for two properties: (1) it does not have the leaf labeled 1 in the
original tree and; (2) for any other leaf w$ occurring at position p in the string,
we have a new leaf .w$ which occurs at position p−1 in the string. For any other
node aw, the error tree is a copy of the error tree at node w (the node pointed
to by node aw’s suffix link) with the following changes: (1) the leaf labeled 1 in
the original error tree is not included; (2) leaves in the copy have a label which
is the original value minus one; (3) a leaf labeled p is included only if sp−1 = a.
Algorithm 2. Copying a sub tree
Input: A node in a suffix tree w
Input: An optional character a (not given when copying the root)
Data: The original string string
copy ←make-copy(w)1
if w is a leaf then2
p ←w.label3
if p = 1 then return null4
if a was not given or stringp−1 = a then5
copy.label ←copy.label - 16
foreach n ∈ w.sons do7
copy of son ←copySubtree(n,a)8
if copy of son isn’t null then9
copy.sons ←copy.sons ∪ copy of son10
if copy.sons is empty then return null11
if copy.sons has only one element then12
merge copy.sons into copy and return that13
return copy14
These conditions are an expression of Lemma 1 and an extension of the condi-
tions for the root. Both are implemented by Algorithm 2. The only point to note
is line 12. Since we filter some leaves, we can create nodes with only one child.
These are removed by merging a child with its (single) parent. Since a typical
suffix tree implementation just stores, at each node, indices to the start and end
of the subtring labeling its incoming edge, merging is achieved by adjusting the
start index. The construction of the tree using either Ukonnen’s or McCreight’s
algorithm assures that this operation is correct.
Copying a tree takes time proportional to the number of nodes it contains.
The error tree at the root is a straightforward copy of the whole tree. Every other
error tree is a copy of an existing one. Since each node can have at most |Σ|
incoming suffix links, each error tree is transversed at most |Σ| times. The sum
4 Constructing the Dotted Tree
We start with a suffix tree and show first how to construct a one-error dotted
tree. We construct the error tree for the root which is almost a copy of the entire
tree, except for two properties: (1) it does not have the leaf labeled 1 in the
original tree and; (2) for any other leaf w$ occurring at position p in the string,
we have a new leaf .w$ which occurs at position p−1 in the string. For any other
node aw, the error tree is a copy of the error tree at node w (the node pointed
to by node aw’s suffix link) with the following changes: (1) the leaf labeled 1 in
the original error tree is not included; (2) leaves in the copy have a label which
is the original value minus one; (3) a leaf labeled p is included only if sp−1 = a.
Algorithm 2. Copying a sub tree
Input: A node in a suffix tree w
Input: An optional character a (not given when copying the root)
Data: The original string string
copy ←make-copy(w)1
if w is a leaf then2
p ←w.label3
if p = 1 then return null4
if a was not given or stringp−1 = a then5
copy.label ←copy.label - 16
foreach n ∈ w.sons do7
copy of son ←copySubtree(n,a)8
if copy of son isn’t null then9
copy.sons ←copy.sons ∪ copy of son10
if copy.sons is empty then return null11
if copy.sons has only one element then12
merge copy.sons into copy and return that13
return copy14
These conditions are an expression of Lemma 1 and an extension of the condi-
tions for the root. Both are implemented by Algorithm 2. The only point to note
is line 12. Since we filter some leaves, we can create nodes with only one child.
These are removed by merging a child with its (single) parent. Since a typical
suffix tree implementation just stores, at each node, indices to the start and end
of the subtring labeling its incoming edge, merging is achieved by adjusting the
start index. The construction of the tree using either Ukonnen’s or McCreight’s
algorithm assures that this operation is correct.
Copying a tree takes time proportional to the number of nodes it contains.
The error tree at the root is a straightforward copy of the whole tree. Every other
error tree is a copy of an existing one. Since each node can have at most |Σ|
incoming suffix links, each error tree is transversed at most |Σ| times. The sum
Page 6
334 L.P. Coelho and A.L. Oliveira
of all these operations is therefore bounded by |Σ|N . Therefore, if the number
of nodes in the final tree is N , construction is done in time O(N |Σ|).
The above algorithm can be used to construct trees with any number of errors
by iterating it. To construct the (k + 1)-error tree from the k-error tree, make
an adjusted copy of the tree as above (adjusting leaves and filtering the leaves
with label 1) and make this the new root error tree. Then, for every other node,
remove the current error tree. Finally, for every node except the root, construct
its error tree as above.
Let Nk be the number of nodes of the k-error dotted tree. We will use N for
Nk if k is known from context. The analysis above remains valid and we now
have that the time cost is O(N1|Σ| + . . . + Nk|Σ|) = O(kN |Σ|).
5 Space Considerations
Let l be the maximum string depth of any node in the tree.3 We show Nk =
O(nlk) by induction. It is known that N0 = O(n). The algorithm for turning
a k-error into a (k + 1)-error dotted tree, can be looked at the following way 4.
First it constructs the error tree at the root and clears all the other error trees.
Then it proceeds in stages, making a (possibly incomplete) copy of this tree
spread amongst the nodes at string-depth 1. It processes the other nodes in
increasing string-depths. At each string depth, the number of nodes is increased
by a maximum of Nk. Therefore, we start with Nk nodes, make an almost full
copy, and copy that at most l times, Nk+1 = O(Nk(l + 1)). Assuming Nk =
O(nlk) by induction we conclude Nk+1 = O(nlk+1).
So far, we have achieved little since in the worst case l = n − 1 (consider
aaaa . . .). However, under very general assumptions (which natural language
textes and dna experimentally verify), the expected case is l = O(log n) [7] and
we have Nk = O(n logk n).
6 Experimental Results
Three data sets were used: English text, the dna of yeast, and randomly gener-
ated text. Results on all sets are qualitatively similar.
To experimentally verify the average case prediction, we show in Figure 2
the ratios between the k-error and the (k + 1)-error dotted trees regarding the
number of nodes in the trees. We can easily see that the experimental values do
resemble a logarithm as predicted.
Searches were then performed on top of previously indexed text. We only
report whether the string exists in the text (and not all occurrences). Therefore,
the number of occurrences has no influence on the search time. Figure 3 shows
3 For a node w, its string depth is |w|.
4 Having the node processed in this order is, in fact, difficult to code for. However, as
an analysis tool, it is a valid assumption.
of all these operations is therefore bounded by |Σ|N . Therefore, if the number
of nodes in the final tree is N , construction is done in time O(N |Σ|).
The above algorithm can be used to construct trees with any number of errors
by iterating it. To construct the (k + 1)-error tree from the k-error tree, make
an adjusted copy of the tree as above (adjusting leaves and filtering the leaves
with label 1) and make this the new root error tree. Then, for every other node,
remove the current error tree. Finally, for every node except the root, construct
its error tree as above.
Let Nk be the number of nodes of the k-error dotted tree. We will use N for
Nk if k is known from context. The analysis above remains valid and we now
have that the time cost is O(N1|Σ| + . . . + Nk|Σ|) = O(kN |Σ|).
5 Space Considerations
Let l be the maximum string depth of any node in the tree.3 We show Nk =
O(nlk) by induction. It is known that N0 = O(n). The algorithm for turning
a k-error into a (k + 1)-error dotted tree, can be looked at the following way 4.
First it constructs the error tree at the root and clears all the other error trees.
Then it proceeds in stages, making a (possibly incomplete) copy of this tree
spread amongst the nodes at string-depth 1. It processes the other nodes in
increasing string-depths. At each string depth, the number of nodes is increased
by a maximum of Nk. Therefore, we start with Nk nodes, make an almost full
copy, and copy that at most l times, Nk+1 = O(Nk(l + 1)). Assuming Nk =
O(nlk) by induction we conclude Nk+1 = O(nlk+1).
So far, we have achieved little since in the worst case l = n − 1 (consider
aaaa . . .). However, under very general assumptions (which natural language
textes and dna experimentally verify), the expected case is l = O(log n) [7] and
we have Nk = O(n logk n).
6 Experimental Results
Three data sets were used: English text, the dna of yeast, and randomly gener-
ated text. Results on all sets are qualitatively similar.
To experimentally verify the average case prediction, we show in Figure 2
the ratios between the k-error and the (k + 1)-error dotted trees regarding the
number of nodes in the trees. We can easily see that the experimental values do
resemble a logarithm as predicted.
Searches were then performed on top of previously indexed text. We only
report whether the string exists in the text (and not all occurrences). Therefore,
the number of occurrences has no influence on the search time. Figure 3 shows
3 For a node w, its string depth is |w|.
4 Having the node processed in this order is, in fact, difficult to code for. However, as
an analysis tool, it is a valid assumption.
Page 7
Dotted Suffix Trees: A Structure for Approximate Text Indexing 335
0
1
2
3
4
5
6
7
8
9
10
0k 50k 100k 150k 200k 250k
R
a
ti
o
Number of Characters
No errors to one error 1 error to 2 errors
Fig. 2. Size ratio on English text
0k
2k
4k
6k
8k
10k
12k
14k
16k
18k
20k
0k 20k 40k 60k 80k 100k 120k 140k 160k 180k 200k
S
te
ps
(a
ve
ra
ge
)
Text Size
Existing string, dna
Existing string, english
Existing string, random text
Non-existing string, dna
Non-existing string, english
Non-existing string, random text
Fig. 3. Searching with 2 errors
the results of searching for 15 character long patterns with 2 errors, while varying
the text size. After an initial small growth explainable by the increasing density
of the tree, the search time is roughly constant.
7 Conclusions
We presented an indexing structure for approximate text matching which takes,
on average, O(n logk n) space. This complexity was predicted theoretically and
observed experimentally. This structure reports the existence of a match in
O(3kmk+1) and reports the positions where the matches occur in O(3kmk+1+ed)
time. It can be constructed in O(kN |Σ|) time, N being the actual number of
nodes. The structure and the algorithms to construct it are simple and easy to
implement. The fact that the structure uses O(ed) time (instead of O(occ)) to
report the occurrences of a pattern may be a disadvantage in some applications.
In other applications (eg, searching in dna strings for degenerated occurrences
of long strings), this will not be a problem since each occurrence will, in general,
correspond to only one edit script.
The amount of space the index takes might limit its applicability. One direc-
tion for tackling this problem is the following remark: in the example for the
string mississippi, presented in Figure 1, one can see that the tree below s.i and
ssi are exactly the same. Whether such occurrences are the basis for a significant
space saving and how to exploit them is an open question. Going further, the
definition of error trees might be extended to structures such as the suffix-dag
presented by Gusfield [8, § 7.7].
Another limitation that should be addressed in future work is related with
the fact that the complexity for reporting occurrences depends on the number
of edit scripts, and not on the number of occurrences.
Acknowledgments. We thank L. Russo and S. Madeira for several productive
discussions.
0
1
2
3
4
5
6
7
8
9
10
0k 50k 100k 150k 200k 250k
R
a
ti
o
Number of Characters
No errors to one error 1 error to 2 errors
Fig. 2. Size ratio on English text
0k
2k
4k
6k
8k
10k
12k
14k
16k
18k
20k
0k 20k 40k 60k 80k 100k 120k 140k 160k 180k 200k
S
te
ps
(a
ve
ra
ge
)
Text Size
Existing string, dna
Existing string, english
Existing string, random text
Non-existing string, dna
Non-existing string, english
Non-existing string, random text
Fig. 3. Searching with 2 errors
the results of searching for 15 character long patterns with 2 errors, while varying
the text size. After an initial small growth explainable by the increasing density
of the tree, the search time is roughly constant.
7 Conclusions
We presented an indexing structure for approximate text matching which takes,
on average, O(n logk n) space. This complexity was predicted theoretically and
observed experimentally. This structure reports the existence of a match in
O(3kmk+1) and reports the positions where the matches occur in O(3kmk+1+ed)
time. It can be constructed in O(kN |Σ|) time, N being the actual number of
nodes. The structure and the algorithms to construct it are simple and easy to
implement. The fact that the structure uses O(ed) time (instead of O(occ)) to
report the occurrences of a pattern may be a disadvantage in some applications.
In other applications (eg, searching in dna strings for degenerated occurrences
of long strings), this will not be a problem since each occurrence will, in general,
correspond to only one edit script.
The amount of space the index takes might limit its applicability. One direc-
tion for tackling this problem is the following remark: in the example for the
string mississippi, presented in Figure 1, one can see that the tree below s.i and
ssi are exactly the same. Whether such occurrences are the basis for a significant
space saving and how to exploit them is an open question. Going further, the
definition of error trees might be extended to structures such as the suffix-dag
presented by Gusfield [8, § 7.7].
Another limitation that should be addressed in future work is related with
the fact that the complexity for reporting occurrences depends on the number
of edit scripts, and not on the number of occurrences.
Acknowledgments. We thank L. Russo and S. Madeira for several productive
discussions.
Page 8
336 L.P. Coelho and A.L. Oliveira
References
1. Weiner, P.: Linear pattern matching algorithms. In: FOCS, IEEE (1973) 1–11
2. Navarro, G.: A guided tour to approximate string matching. ACM Computing
Surveys 33 (2001)
3. Maaß, M.G., Nowak, J.: Text indexing with erros. In: Proc. 16th Annual Symp. on
Combinatorial Pattern Matching (CPM). Volume 3537 of LNCS., Springer (2005)
21–32
4. Chattaraj, A., Parida, L.: An inexact-suffix-tree-based algorithm for detecting ex-
tensible patterns. Theor. Comput. Sci. 335 (2005) 3–14
5. Cole, R., Gottlieb, L.A., Lewenstein, M.: Dictionary matching and indexing with
errors and don’t cares. In: STOC. (2004) 91–100
6. McCreight, E.: A space-economical suffix tree construction algorithm. J. ACM 23
(1976) 262–272
7. Apostolico, A., Szpankowski, W.: Self-alignments in words and their applications.
J. Algorithms 13 (1992) 446–467
8. Gusfield, D.: Algorithms on strings, trees, and sequences: computer science and
computational biology. Cambridge University Press, New York, NY, USA (1997)
References
1. Weiner, P.: Linear pattern matching algorithms. In: FOCS, IEEE (1973) 1–11
2. Navarro, G.: A guided tour to approximate string matching. ACM Computing
Surveys 33 (2001)
3. Maaß, M.G., Nowak, J.: Text indexing with erros. In: Proc. 16th Annual Symp. on
Combinatorial Pattern Matching (CPM). Volume 3537 of LNCS., Springer (2005)
21–32
4. Chattaraj, A., Parida, L.: An inexact-suffix-tree-based algorithm for detecting ex-
tensible patterns. Theor. Comput. Sci. 335 (2005) 3–14
5. Cole, R., Gottlieb, L.A., Lewenstein, M.: Dictionary matching and indexing with
errors and don’t cares. In: STOC. (2004) 91–100
6. McCreight, E.: A space-economical suffix tree construction algorithm. J. ACM 23
(1976) 262–272
7. Apostolico, A., Szpankowski, W.: Self-alignments in words and their applications.
J. Algorithms 13 (1992) 446–467
8. Gusfield, D.: Algorithms on strings, trees, and sequences: computer science and
computational biology. Cambridge University Press, New York, NY, USA (1997)
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
5 Readers on Mendeley
by Discipline
20% Linguistics
by Academic Status
20% Student (Bachelor)
20% Student (Master)
20% Student (Postgraduate)
by Country
20% Italy
20% Denmark
20% United States


