Using Apache Spark on genome assembly for scalable overlap-graph reduction

3Citations
Citations of this article
15Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

BACKGROUND: De novo genome assembly is a technique that builds the genome of a specimen using overlaps of genomic fragments without additional work with reference sequence. Sequence fragments (called reads) are assembled as contigs and scaffolds by the overlaps. The quality of the de novo assembly depends on the length and continuity of the assembly. To enable faster and more accurate assembly of species, existing sequencing techniques have been proposed, for example, high-throughput next-generation sequencing and long-reads-producing third-generation sequencing. However, these techniques require a large amounts of computer memory when very huge-size overlap graphs are resolved. Also, it is challenging for parallel computation. RESULTS: To address the limitations, we propose an innovative algorithmic approach, called Scalable Overlap-graph Reduction Algorithms (SORA). SORA is an algorithm package that performs string graph reduction algorithms by Apache Spark. The SORA's implementations are designed to execute de novo genome assembly on either a single machine or a distributed computing platform. SORA efficiently compacts the number of edges on enormous graphing paths by adapting scalable features of graph processing libraries provided by Apache Spark, GraphX and GraphFrames. CONCLUSIONS: We shared the algorithms and the experimental results at our project website, https://github.com/BioHPC/SORA . We evaluated SORA with the human genome samples. First, it processed a nearly one billion edge graph on a distributed cloud cluster. Second, it processed mid-to-small size graphs on a single workstation within a short time frame. Overall, SORA achieved the linear-scaling simulations for the increased computing instances.

Cite

CITATION STYLE

APA

Paul, A. J., Lawrence, D., Song, M., Lim, S. H., Pan, C., & Ahn, T. H. (2019). Using Apache Spark on genome assembly for scalable overlap-graph reduction. Human Genomics, 13, 48. https://doi.org/10.1186/s40246-019-0227-1

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free