Markov model recognition and classification of DNA/protein sequences within large text databases

14Citations
Citations of this article
43Readers
Mendeley users who have this article in their library.

Abstract

Motivation: Short sequence patterns frequently define regions of biological interest (binding sites, immune epitopes, primers, etc.), yet a large fraction of this information exists only within the scientific literature and is thus difficult to locate via conventional means (e.g. keyword queries or manual searches). We describe herein a system to accurately identify and classify sequence patterns from within large corpora using an n-gram Markov model (MM). Results: As expected, on test sets we found that identification of sequences with limited alphabets and/or regular structures such as nucleic acids (non-ambiguous) and peptide abbreviations (3-letter) was highly accurate, whereas classification of symbolic (1-letter) peptide strings with more complex alphabets was more problematic. The MM was used to analyze two very large, sequence-containing corpora: over 7.75 million Medline abstracts and 9000 full-text articles from Journal of Virology. Performance was benchmarked by comparing the results with Journal of Virology entries in two existing manually curated databases: VirOligo and the HLA Ligand Database. Performance estimates were 98 ± 2% precision/84% recall for primer identification and classification and 67 ± 6% precision/85% recall for peptide epitopes. We also find a dramatic difference between the amounts of sequence-related data reported in abstracts versus full text. Our results suggest that automated extraction and classification of sequence elements is a promising, low-cost means of sequence database curation and annotation. © The Author 2005. Published by Oxford University Press. All right reserved.

Cite

CITATION STYLE

APA

Wren, J. D., Hildebrand, W. H., Chandrasekaran, S., & Melcher, U. (2005). Markov model recognition and classification of DNA/protein sequences within large text databases. Bioinformatics, 21(21), 4046–4053. https://doi.org/10.1093/bioinformatics/bti657

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free