Establishing a method of vector contamination identification in database sequences

22Citations
Citations of this article
14Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Motivation: The nucleotide sequence databases are invaluable tools both for the private and the academic research communities, from the retrieval of sequences to homology searching. Several issues related to data quality, such as the existence of sequencing artifacts and errors, are facing the databases. We investigated a major source of these errors, i.e. the presence of vector-contaminated sequences. Results: Using a panel of 180 vector polylinker sequences, we found 0.36% or 3029 vector-matching sequences in GenBank Release 95-96, with all average vector-matching length of 72 nucleotides. The number of vector-contaminated sequences has been growing with the database; however, the percent contamination has remained approximately constant at an average of 0.28% from 1982 to 1996. Availability: Access to the database of vector polylinker sequences via sequence similarity searching is available at http://seqsim.ncgr.org/vector/ Contact: gas@@@molinfo.com.

Cite

CITATION STYLE

APA

Seluja, G. A., Farmer, A., McLeod, M., Harger, C., & Schad, P. A. (1999). Establishing a method of vector contamination identification in database sequences. Bioinformatics, 15(2), 106–110. https://doi.org/10.1093/bioinformatics/15.2.106

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free