Establishing a method of vector contamination identification in database sequences

Gustavo A. Seluja; Andrew Farmer; Mia McLeod; Carol Harger; Peter A. Schad

Journal ArticleOPEN ACCESS

Establishing a method of vector contamination identification in database sequences

Bioinformatics (1999) 15(2) 106-110

DOI: 10.1093/bioinformatics/15.2.106

22Citations

14Readers

Abstract

Motivation: The nucleotide sequence databases are invaluable tools both for the private and the academic research communities, from the retrieval of sequences to homology searching. Several issues related to data quality, such as the existence of sequencing artifacts and errors, are facing the databases. We investigated a major source of these errors, i.e. the presence of vector-contaminated sequences. Results: Using a panel of 180 vector polylinker sequences, we found 0.36% or 3029 vector-matching sequences in GenBank Release 95-96, with all average vector-matching length of 72 nucleotides. The number of vector-contaminated sequences has been growing with the database; however, the percent contamination has remained approximately constant at an average of 0.28% from 1982 to 1996. Availability: Access to the database of vector polylinker sequences via sequence similarity searching is available at http://seqsim.ncgr.org/vector/ Contact: gas@@@molinfo.com.

Cite

CITATION STYLE

APA

Seluja, G. A., Farmer, A., McLeod, M., Harger, C., & Schad, P. A. (1999). Establishing a method of vector contamination identification in database sequences. Bioinformatics, 15(2), 106–110. https://doi.org/10.1093/bioinformatics/15.2.106

Establishing a method of vector contamination identification in database sequences

Abstract

Cite

Register to see more suggestions