RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches

3Citations
Citations of this article
11Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

We present RabbitTClust, a fast and memory-efficient genome clustering tool based on sketch-based distance estimation. Our approach enables efficient processing of large-scale datasets by combining dimensionality reduction techniques with streaming and parallelization on modern multi-core platforms. 113,674 complete bacterial genome sequences from RefSeq, 455 GB in FASTA format, can be clustered within less than 6 min and 1,009,738 GenBank assembled bacterial genomes, 4.0 TB in FASTA format, within only 34 min on a 128-core workstation. Our results further identify 1269 redundant genomes, with identical nucleotide content, in the RefSeq bacterial genomes database.

Cite

CITATION STYLE

APA

Xu, X., Yin, Z., Yan, L., Zhang, H., Xu, B., Wei, Y., … Liu, W. (2023). RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches. Genome Biology, 24(1). https://doi.org/10.1186/s13059-023-02961-6

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free