GenAp: A distributed SQL interface for genomic data

Christos Kozanitis; David A. Patterson

Journal ArticleOPEN ACCESS

GenAp: A distributed SQL interface for genomic data

BMC Bioinformatics (2016) 17(1)

DOI: 10.1186/s12859-016-0904-1

13Citations

59Readers

Abstract

Background: The impressively low cost and improved quality of genome sequencing provides to researchers of genetic diseases, such as cancer, a powerful tool to better understand the underlying genetic mechanisms of those diseases and treat them with effective targeted therapies. Thus, a number of projects today sequence the DNA of large patient populations each of which produces at least hundreds of terra-bytes of data. Now the challenge is to provide the produced data on demand to interested parties. Results: In this paper, we show that the response to this challenge is a modified version of Spark SQL, a distributed SQL execution engine, that handles efficiently joins that use genomic intervals as keys. With this modification, Spark SQL serves such joins more than 50× faster than its existing brute force approach and 8× faster than similar distributed implementations. Thus, Spark SQL can replace existing practices to retrieve genomic data and, as we show, allow users to reduce the number of lines of software code that needs to be developed to query such data by an order of magnitude.

Author supplied keywords

Cite

CITATION STYLE

APA

Kozanitis, C., & Patterson, D. A. (2016). GenAp: A distributed SQL interface for genomic data. BMC Bioinformatics, 17(1). https://doi.org/10.1186/s12859-016-0904-1

GenAp: A distributed SQL interface for genomic data

Abstract

Author supplied keywords

Cite

Register to see more suggestions