Evaluating genomic big data operations on SciDB and spark

Simone Cattani; Stefano Ceri; Abdulrahman Kaitoua; Pietro Pinoli

Conference ProceedingsOPEN ACCESS

Evaluating genomic big data operations on SciDB and spark

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2017) 10360 LNCS 482-493

DOI: 10.1007/978-3-319-60131-1_34

4Citations

5Readers

Abstract

We are developing a new, holistic data management system for genomics, which provides high-level abstractions for querying large genomic datasets. We designed our system so that it leverages on data management engines for low-level data access. Such design can be adapted to two different kinds of data engines: the family of scientific databases (among them, SciDB) and the broader family of generic platforms (among them, Spark). Trade-offs are not obvious; scientific databases are expected to outperform generic platforms when they use features which are embedded within their specialized design, but generic platforms are expected to outperform scientific databases on generalpurpose operations. In this paper, we compare our SciDB and Spark implementations at work on genomic abstractions. We use four typical genomic operations as benchmark, stemming from the concrete requirements of our project, and encoded using SciDB and Spark; we discuss their common aspects and differences, specifically discussing how genomic regions and operations can be expressed using SciDB arrays. We comparatively evaluate the performance and scalability of the two implementations over datasets consisting of billions of genomic regions.

Cite

CITATION STYLE

APA

Cattani, S., Ceri, S., Kaitoua, A., & Pinoli, P. (2017). Evaluating genomic big data operations on SciDB and spark. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10360 LNCS, pp. 482–493). Springer Verlag. https://doi.org/10.1007/978-3-319-60131-1_34

Evaluating genomic big data operations on SciDB and spark

Abstract

Cite

Register to see more suggestions