Abstract
Algorithms for finding similar, or highly conserved, regions in a group of sequences are at the core of many molecular biology problems. We solve three main open questions in this area. Assume that we are given n DNA sequences s1,...,sn. The Consensus Patterns problem, which has been widely studied in bioinformatics research, in its simplest form, asks for a region of length L in each si, and a median string s of length L so that the total Hamming distance from s to these regions is minimized. We show the problem is NP-hard and give a polynomial time approximation scheme (PTAS) for it. We also give a PTAS for the problem under the original measure of [26, 16, 12, 25]. As an interesting application of our analysis, we further obtain a PTAS for a restricted (but still NP-hard) version of the important star alignment problem allowing at most constant number of gaps, each of arbitrary length, in each sequence. The Closest String problem asks for the smallest d and a string s which is within Hamming distance d to each si. The problem is NP-hard. [3] gives a polynomial time algorithm for constant d. For super-logarithmic d, [2, 9] give efficient approximation algorithms using linear program relaxation techniques. The best polynomial time approximation has ratio 4/3 for all d, given by [18] ([9] also independently claimed the 4/3 ratio but only for super-logarithmic d). We settle the problem with a PTAS. We then give the first nontrivial better-than-2 approximation with ratio 2-2/2|Σ|+1 for the more elusive Closest Substring problem: find a string s of length L such that, for each i, s is within Hamming distance d from some substring, of length L, of si.
Cite
CITATION STYLE
Li, M., Ma, B., & Wang, L. (1999). Finding similar regions in many strings. Conference Proceedings of the Annual ACM Symposium on Theory of Computing, 473–482. https://doi.org/10.1145/301250.301376
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.