We study the problem of approximate non-tandem repeat† extraction. Given a long subject string 5 of length N over a finite alphabet ∑ and a threshold D, we would like to find all short substrings of S of length P that repeat with at most D differences, i.e., insertions, deletions, and mismatches. We give a careful theoretical characterization of the set of seeds (i.e., some maximal exact repeats) required by the algorithm, and prove a sublinear bound on their expected numbers. Using this result, we present a sub-quadratic algorithm for finding all short (i.e., of length O(logAO)) approximate repeats. The running time of our algorithm is O(DN3P°w(ε)-1 log N), where ε = D/P and pow(ε) is an increasing, concave function that is 0 when ε = 0 and about 0.9 for DNA and protein sequences. © Oxford University Press 2001.
CITATION STYLE
Adebiyi, E. F., Jiang, T., & Kaufmann, M. (2001). An efficient algorithm for finding short approximate non-tandem repeats. Bioinformatics, 17(SUPPL. 1). https://doi.org/10.1093/bioinformatics/17.suppl_1.S5
Mendeley helps you to discover research relevant for your work.