High-throughput and scalable protein function identification with Hadoop and Map-only pattern of the MapReduce processing model

Dariusz Mrozek; Marek Suwała; Bożena Małysiak-Mrozek

Journal ArticleOPEN ACCESS

High-throughput and scalable protein function identification with Hadoop and Map-only pattern of the MapReduce processing model

Knowledge and Information Systems (2019) 60(1) 145-178

DOI: 10.1007/s10115-018-1245-3

5Citations

11Readers

Abstract

Efficient computational solutions for identification of protein functions or finding structural homologs of proteins gain importance in the era of structural genomics and in the face of growing volumes of biological data. Structural alignments, which underlie these two processes, take a lot of time to complete, especially when performed for large collections of 3D protein structures. Fortunately, structural alignments can be carried out on well-separable and independent subsets of the whole macromolecular data repository, which perfectly fits the MapReduce processing paradigm of bringing computations to data. In this paper, we show how the protein function identification and finding structural homologs can be efficiently accelerated with the use of the MapReduce procedure executed on Hadoop cluster established in a virtualized compute environment or a private cloud. For this purpose, we propose Map-only processing pattern of the MapReduce procedure, which is formally defined in this paper. The solution that we show joins advantages of performing computations in small virtualized compute environments with large-scale computations in public clouds, thus allowing to perform structural alignments for a number of usage scenarios, including comparison of pairs of 3D protein structures during evaluation of predicted protein models, one-to-many comparisons while identifying possible functions of the given structure, or all-to-all alignments while investigating the divergence between known protein structures and classifying proteins by their fold. In this paper, we also present results of performance tests when scaling up nodes of the Hadoop cluster and increasing the degree of parallelism with the intention of improving efficiency of the computations.

Author supplied keywords

References Powered by Scopus

View more at Scopus

Cited by Powered by Scopus

View more at Scopus

Cite

CITATION STYLE

APA

Mrozek, D., Suwała, M., & Małysiak-Mrozek, B. (2019). High-throughput and scalable protein function identification with Hadoop and Map-only pattern of the MapReduce processing model. Knowledge and Information Systems, 60(1), 145–178. https://doi.org/10.1007/s10115-018-1245-3

Readers over time

Readers' Seniority

PhD / Post grad / Masters / Doc 4

57%

Professor / Associate Prof. 1

14%

Lecturer / Post doc 1

14%

Researcher 1

14%

Readers' Discipline

Computer Science 2

40%

Nursing and Health Professions 1

20%

Economics, Econometrics and Finance 1

20%

Arts and Humanities 1

20%

High-throughput and scalable protein function identification with Hadoop and Map-only pattern of the MapReduce processing model

Abstract

Author supplied keywords

References Powered by Scopus

The Protein Data Bank

A global reference for human genetic variation

Protein structure comparison by alignment of distance matrices

Cited by Powered by Scopus

Spark-IDPP: high-throughput and scalable prediction of intrinsically disordered protein regions with Spark clusters on the Cloud

PmTM-align: Scalable pairwise and multiple structure alignment with Apache Spark and OpenMP

Efficient 3D protein structure alignment on large hadoop clusters in microsoft azure cloud

Register to see more suggestions

Cite

Readers over time

Readers' Seniority

Readers' Discipline