AGC: compact representation of assembled genomes with fast queries and updates

3Citations
Citations of this article
11Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Motivation: High-quality sequence assembly is the ultimate representation of complete genetic information of an individual. Several ongoing pangenome projects are producing collections of high-quality assemblies of various species. Each project has already generated assemblies of hundreds of gigabytes on disk, greatly impeding the distribution of and access to such rich datasets. Results: Here, we show how to reduce the size of the sequenced genomes by 2-3 orders of magnitude. Our tool compresses the genomes significantly better than the existing programs and is much faster. Moreover, its unique feature is the ability to access any contig (or its part) in a fraction of a second and easily append new samples to the compressed collections. Thanks to this, AGC could be useful not only for backup or transfer purposes but also for routine analysis of pangenome sequences in common pipelines. With the rapidly reduced cost and improved accuracy of sequencing technologies, we anticipate more comprehensive pangenome projects with much larger sample sizes. AGC is likely to become a foundation tool to store, distribute and access pangenome data.

Cite

CITATION STYLE

APA

Deorowicz, S., Danek, A., & Li, H. (2023). AGC: compact representation of assembled genomes with fast queries and updates. Bioinformatics, 39(3). https://doi.org/10.1093/bioinformatics/btad097

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free