Molecular sequences, like all experimental data, are subject to error. Many current DNA sequencing protocols have very sign error rates and often generate artefactual insertions and deletions of bases (indels) which corrupt the translation of sequences and compromise the detection of protein homologies. The impact of these errors on the utility of molecular sequence data is dependent on the analytic technique used to interpret the data. In the presence of frameshift errors, standard algorithms using six-frame translation can miss important homologies because only subfragments of the correct translation are available in any given frame. We present a new algorithm which can detect and correct frameshift errors in DNA sequences during comparison of translated sequences with protein sequences in the databases. This algorithm can recognize homologous proteins sharing 30% identity even in the presence of a 7% frameshift error rate. Our algorithm uses dynamic programming, producing a guaranteed optimal alignment in the presence of frameshifts, and has a sensitivity equivalent to Smith-Waterman. The computational efficiency of the algorithm is O(nm) where n and m are the sizes of two sequences being compared. The algorithm does not rely on prior knowledge or heuristic rules and performs sign better than any previously reported method. © 1996, Oxford University Press.
CITATION STYLE
Guan, X., & Uberbacher, E. C. (1996). Alignments of dna and protein sequences containing frameshift errors. Bioinformatics, 12(1), 31–40. https://doi.org/10.1093/bioinformatics/12.1.31
Mendeley helps you to discover research relevant for your work.