Abstract
We present RabbitTClust, a fast and memory-efficient genome clustering tool based on sketch-based distance estimation. Our approach enables efficient processing of large-scale datasets by combining dimensionality reduction techniques with streaming and parallelization on modern multi-core platforms. 113,674 complete bacterial genome sequences from RefSeq, 455 GB in FASTA format, can be clustered within less than 6 min and 1,009,738 GenBank assembled bacterial genomes, 4.0 TB in FASTA format, within only 34 min on a 128-core workstation. Our results further identify 1269 redundant genomes, with identical nucleotide content, in the RefSeq bacterial genomes database.
Author supplied keywords
Cite
CITATION STYLE
Xu, X., Yin, Z., Yan, L., Zhang, H., Xu, B., Wei, Y., … Liu, W. (2023). RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches. Genome Biology, 24(1). https://doi.org/10.1186/s13059-023-02961-6
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.