Motivation: The nucleotide sequence databases are invaluable tools both for the private and the academic research communities, from the retrieval of sequences to homology searching. Several issues related to data quality, such as the existence of sequencing artifacts and errors, are facing the databases. We investigated a major source of these errors, i.e. the presence of vector-contaminated sequences. Results: Using a panel of 180 vector polylinker sequences, we found 0.36% or 3029 vector-matching sequences in GenBank Release 95-96, with all average vector-matching length of 72 nucleotides. The number of vector-contaminated sequences has been growing with the database; however, the percent contamination has remained approximately constant at an average of 0.28% from 1982 to 1996. Availability: Access to the database of vector polylinker sequences via sequence similarity searching is available at http://seqsim.ncgr.org/vector/ Contact: gas@@@molinfo.com.
CITATION STYLE
Seluja, G. A., Farmer, A., McLeod, M., Harger, C., & Schad, P. A. (1999). Establishing a method of vector contamination identification in database sequences. Bioinformatics, 15(2), 106–110. https://doi.org/10.1093/bioinformatics/15.2.106
Mendeley helps you to discover research relevant for your work.