Finding similar regions in many strings

110Citations
Citations of this article
36Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Algorithms for finding similar, or highly conserved, regions in a group of sequences are at the core of many molecular biology problems. We solve three main open questions in this area. Assume that we are given n DNA sequences s1,...,sn. The Consensus Patterns problem, which has been widely studied in bioinformatics research, in its simplest form, asks for a region of length L in each si, and a median string s of length L so that the total Hamming distance from s to these regions is minimized. We show the problem is NP-hard and give a polynomial time approximation scheme (PTAS) for it. We also give a PTAS for the problem under the original measure of [26, 16, 12, 25]. As an interesting application of our analysis, we further obtain a PTAS for a restricted (but still NP-hard) version of the important star alignment problem allowing at most constant number of gaps, each of arbitrary length, in each sequence. The Closest String problem asks for the smallest d and a string s which is within Hamming distance d to each si. The problem is NP-hard. [3] gives a polynomial time algorithm for constant d. For super-logarithmic d, [2, 9] give efficient approximation algorithms using linear program relaxation techniques. The best polynomial time approximation has ratio 4/3 for all d, given by [18] ([9] also independently claimed the 4/3 ratio but only for super-logarithmic d). We settle the problem with a PTAS. We then give the first nontrivial better-than-2 approximation with ratio 2-2/2|Σ|+1 for the more elusive Closest Substring problem: find a string s of length L such that, for each i, s is within Hamming distance d from some substring, of length L, of si.

Cite

CITATION STYLE

APA

Li, M., Ma, B., & Wang, L. (1999). Finding similar regions in many strings. Conference Proceedings of the Annual ACM Symposium on Theory of Computing, 473–482. https://doi.org/10.1145/301250.301376

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free