Big omics data experience

8Citations
Citations of this article
35Readers
Mendeley users who have this article in their library.

Abstract

As personalized medicine becomes more integrated into healthcare, the rate at which human genomes are being sequenced is rising quickly together with a concomitant acceleration in compute and storage requirements. To achieve the most effective solution for genomic workloads without re-architecting the industry-standard software, we performed a rigorous analysis of usage statistics, benchmarks and available technologies to design a system for maximum throughput. We share our experiences designing a system optimized for the "Genome Analysis ToolKit (GATK) Best Practices" whole genome DNA and RNA pipeline based on an evaluation of compute, workload and I/O characteristics. The characteristics of genomic-based workloads are vastly different from those of traditional HPC workloads, requiring different configurations of the scheduler and the I/O subsystem to achieve reliability, performance and scalability. By understanding how our researchers and clinicians work, we were able to employ techniques not only to speed up their workflow yielding improved and repeatable performance, but also to make more efficient use of storage and compute resources.

Cite

CITATION STYLE

APA

Kovatch, P., Costa, A., Giles, Z., Fluder, E., Cho, H. M., & Mazurkova, S. (2015). Big omics data experience. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC (Vol. 15-20-November-2015). IEEE Computer Society. https://doi.org/10.1145/2807591.2807595

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free