Big omics data experience

Patricia Kovatch; Anthony Costa; Zachary Giles; Eugene Fluder; Hyung Min Cho; Svetlana Mazurkova

Conference ProceedingsOPEN ACCESS

Big omics data experience

International Conference for High Performance Computing, Networking, Storage and Analysis, SC (2015) 15-20-November-2015

DOI: 10.1145/2807591.2807595

8Citations

35Readers

Abstract

As personalized medicine becomes more integrated into healthcare, the rate at which human genomes are being sequenced is rising quickly together with a concomitant acceleration in compute and storage requirements. To achieve the most effective solution for genomic workloads without re-architecting the industry-standard software, we performed a rigorous analysis of usage statistics, benchmarks and available technologies to design a system for maximum throughput. We share our experiences designing a system optimized for the "Genome Analysis ToolKit (GATK) Best Practices" whole genome DNA and RNA pipeline based on an evaluation of compute, workload and I/O characteristics. The characteristics of genomic-based workloads are vastly different from those of traditional HPC workloads, requiring different configurations of the scheduler and the I/O subsystem to achieve reliability, performance and scalability. By understanding how our researchers and clinicians work, we were able to employ techniques not only to speed up their workflow yielding improved and repeatable performance, but also to make more efficient use of storage and compute resources.

Author supplied keywords

Cite

CITATION STYLE

APA

Kovatch, P., Costa, A., Giles, Z., Fluder, E., Cho, H. M., & Mazurkova, S. (2015). Big omics data experience. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC (Vol. 15-20-November-2015). IEEE Computer Society. https://doi.org/10.1145/2807591.2807595

Big omics data experience

Abstract

Author supplied keywords

Cite

Register to see more suggestions