A fundamental requirement for many applications in genomics is the sequencing of genetic material (DNA/RNA). Different sequencing technologies exist, but all aim to accurately reproduce the sequence of nucleotides (the individual units of DNA and RNA) in the genetic material under investigation. The result of such efforts is a text file containing the individual fragments of genetic material-termed "reads"-represented as strings of letters (A, C, G, and T/U). The amount of data in one of these read files depends on how much genetic material was present and how long the sequencing device was operated. Read depth (coverage) is a measure of the volume of genetic data contained in a read file. For example, coverage of 5x indicates that, on average, each nucleotide in the original genetic material is represented five times in the read file. Many of the computational methods employed in genomics are affected by coverage; coun-terintuitively, more is not always better. For example, because sequencing devices are not perfect, reads inevitably contain errors. As such, higher coverage increases the number of errors and potentially makes them look like alternative sequences. Furthermore, for some applications, too much coverage can cause a degradation in computational performance via increased runtimes or memory usage. We present Rasusa, a software program that randomly subsamples a given read file to a specified coverage. Rasusa is written in the Rust programming language and is much faster than current solutions for subsampling read files. In addition, it provides an ergonomic command-line interface and allows users to specify a desired coverage or a target number of nucleotides. Statement of need Read subsampling is a useful mechanism for creating artificial datasets, allowing exploration of a computational method's performance as data becomes more scarce. In addition, the coverage of a sample can have a significant impact on a variety of computational methods, such as RNA-seq (Baccarella et al., 2018), taxonomic classification (Gweon et al., 2019), antimicrobial resistance detection (Gweon et al., 2019), and genome assembly (Maio et al., 2019)-to name a few.
CITATION STYLE
Hall, M. (2022). Rasusa: Randomly subsample sequencing reads to a specified coverage. Journal of Open Source Software, 7(69), 3941. https://doi.org/10.21105/joss.03941
Mendeley helps you to discover research relevant for your work.