Clustering near-identical sequences for fast homology search

1Citations
Citations of this article
9Readers
Mendeley users who have this article in their library.
Get full text

Abstract

We present a new approach to managing redundancy in sequence databanks such as GenBank. We store clusters of near-identical sequences as a representative union-sequence and a set of corresponding edits to that sequence. During search, the query is compared to only the union-sequences representing each cluster; cluster members are then only reconstructed and aligned if the union-sequence achieves a sufficiently high score. Using this approach in BLAST results in a 27% reduction is collection size and a corresponding 22% decrease in search time with no significant change in accuracy. We also describe our method for clustering that uses fingerprinting, an approach that has been successfully applied to collections of text and web documents in Information Retrieval. Our clustering approach is ten times faster on the GenBank nonredundant protein database than the fastest existing approach, CD-HIT. We have integrated our approach into FSA-BLAST, our new Open Source version of BLAST, available from http : //www. fsa-blast.org/.As a result, FSA-BLAST is twice as fast as NCBI-BLAST with no significant change in accuracy. © Springer-Verlag Berlin Heidelberg 2006.

Cite

CITATION STYLE

APA

Cameron, M., Bernstein, Y., & Williams, H. E. (2006). Clustering near-identical sequences for fast homology search. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 3909 LNBI, pp. 175–189). https://doi.org/10.1007/11732990_16

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free