High performance merging of massive data from genome-wide association studies

0Citations
Citations of this article
1Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The traditional data processing methods working on single computer show less scalability and efficiency for performing ordered full-outer-joining, on merging large number of individual Genome-Wide Associations Studies (GWAS) data. Although the emerging of big data platforms such as Hadoop and Spark shed lights on this problem, the inefficiency of keeping data in total-sorted order as well as the workload imbalance problem limit their performance. In this study, we designed and compared three new methodologies based on MapReduce, HBase and Spark respectively, to merge hundreds of individuals VCF files on their Single Nucleotide Polymorphism (SNP) location into a single TPED file. Our methodologies overcame the limitations stated above and considerably improved the performance with good scalability on input size and computing resources.

Cite

CITATION STYLE

APA

Sun, X., Wang, F., & Qin, Z. (2017). High performance merging of massive data from genome-wide association studies. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10494 LNCS, pp. 36–40). Springer Verlag. https://doi.org/10.1007/978-3-319-67186-4_4

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free