GenAp: A distributed SQL interface for genomic data

13Citations
Citations of this article
59Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Background: The impressively low cost and improved quality of genome sequencing provides to researchers of genetic diseases, such as cancer, a powerful tool to better understand the underlying genetic mechanisms of those diseases and treat them with effective targeted therapies. Thus, a number of projects today sequence the DNA of large patient populations each of which produces at least hundreds of terra-bytes of data. Now the challenge is to provide the produced data on demand to interested parties. Results: In this paper, we show that the response to this challenge is a modified version of Spark SQL, a distributed SQL execution engine, that handles efficiently joins that use genomic intervals as keys. With this modification, Spark SQL serves such joins more than 50× faster than its existing brute force approach and 8× faster than similar distributed implementations. Thus, Spark SQL can replace existing practices to retrieve genomic data and, as we show, allow users to reduce the number of lines of software code that needs to be developed to query such data by an order of magnitude.

Cite

CITATION STYLE

APA

Kozanitis, C., & Patterson, D. A. (2016). GenAp: A distributed SQL interface for genomic data. BMC Bioinformatics, 17(1). https://doi.org/10.1186/s12859-016-0904-1

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free