We present RabbitTClust, a fast and memory-efficient genome clustering tool based on sketch-based distance estimation. Our approach enables efficient processing of large-scale datasets by combining dimensionality reduction techniques with streaming and parallelization on modern multi-core platforms. 113,674 complete bacterial genome sequences from RefSeq, 455 GB in FASTA format, can be clustered within less than 6 min and 1,009,738 GenBank assembled bacterial genomes, 4.0 TB in FASTA format, within only 34 min on a 128-core workstation. Our results further identify 1269 redundant genomes, with identical nucleotide content, in the RefSeq bacterial genomes database.
CITATION STYLE
Xu, X., Yin, Z., Yan, L., Zhang, H., Xu, B., Wei, Y., … Liu, W. (2023). RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches. Genome Biology, 24(1). https://doi.org/10.1186/s13059-023-02961-6
Mendeley helps you to discover research relevant for your work.