Motivation: Short sequence patterns frequently define regions of biological interest (binding sites, immune epitopes, primers, etc.), yet a large fraction of this information exists only within the scientific literature and is thus difficult to locate via conventional means (e.g. keyword queries or manual searches). We describe herein a system to accurately identify and classify sequence patterns from within large corpora using an n-gram Markov model (MM). Results: As expected, on test sets we found that identification of sequences with limited alphabets and/or regular structures such as nucleic acids (non-ambiguous) and peptide abbreviations (3-letter) was highly accurate, whereas classification of symbolic (1-letter) peptide strings with more complex alphabets was more problematic. The MM was used to analyze two very large, sequence-containing corpora: over 7.75 million Medline abstracts and 9000 full-text articles from Journal of Virology. Performance was benchmarked by comparing the results with Journal of Virology entries in two existing manually curated databases: VirOligo and the HLA Ligand Database. Performance estimates were 98 ± 2% precision/84% recall for primer identification and classification and 67 ± 6% precision/85% recall for peptide epitopes. We also find a dramatic difference between the amounts of sequence-related data reported in abstracts versus full text. Our results suggest that automated extraction and classification of sequence elements is a promising, low-cost means of sequence database curation and annotation. © The Author 2005. Published by Oxford University Press. All right reserved.
CITATION STYLE
Wren, J. D., Hildebrand, W. H., Chandrasekaran, S., & Melcher, U. (2005). Markov model recognition and classification of DNA/protein sequences within large text databases. Bioinformatics, 21(21), 4046–4053. https://doi.org/10.1093/bioinformatics/bti657
Mendeley helps you to discover research relevant for your work.