High-throughput and scalable protein function identification with Hadoop and Map-only pattern of the MapReduce processing model

5Citations
Citations of this article
11Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Efficient computational solutions for identification of protein functions or finding structural homologs of proteins gain importance in the era of structural genomics and in the face of growing volumes of biological data. Structural alignments, which underlie these two processes, take a lot of time to complete, especially when performed for large collections of 3D protein structures. Fortunately, structural alignments can be carried out on well-separable and independent subsets of the whole macromolecular data repository, which perfectly fits the MapReduce processing paradigm of bringing computations to data. In this paper, we show how the protein function identification and finding structural homologs can be efficiently accelerated with the use of the MapReduce procedure executed on Hadoop cluster established in a virtualized compute environment or a private cloud. For this purpose, we propose Map-only processing pattern of the MapReduce procedure, which is formally defined in this paper. The solution that we show joins advantages of performing computations in small virtualized compute environments with large-scale computations in public clouds, thus allowing to perform structural alignments for a number of usage scenarios, including comparison of pairs of 3D protein structures during evaluation of predicted protein models, one-to-many comparisons while identifying possible functions of the given structure, or all-to-all alignments while investigating the divergence between known protein structures and classifying proteins by their fold. In this paper, we also present results of performance tests when scaling up nodes of the Hadoop cluster and increasing the degree of parallelism with the intention of improving efficiency of the computations.

References Powered by Scopus

The Protein Data Bank

32135Citations
N/AReaders
Get full text

A global reference for human genetic variation

11641Citations
N/AReaders
Get full text

Protein structure comparison by alignment of distance matrices

3616Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Spark-IDPP: high-throughput and scalable prediction of intrinsically disordered protein regions with Spark clusters on the Cloud

32Citations
N/AReaders
Get full text

PmTM-align: Scalable pairwise and multiple structure alignment with Apache Spark and OpenMP

4Citations
N/AReaders
Get full text

Efficient 3D protein structure alignment on large hadoop clusters in microsoft azure cloud

1Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Mrozek, D., Suwała, M., & Małysiak-Mrozek, B. (2019). High-throughput and scalable protein function identification with Hadoop and Map-only pattern of the MapReduce processing model. Knowledge and Information Systems, 60(1), 145–178. https://doi.org/10.1007/s10115-018-1245-3

Readers over time

‘19‘20‘21‘22‘2400.751.52.253

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 4

57%

Professor / Associate Prof. 1

14%

Lecturer / Post doc 1

14%

Researcher 1

14%

Readers' Discipline

Tooltip

Computer Science 2

40%

Nursing and Health Professions 1

20%

Economics, Econometrics and Finance 1

20%

Arts and Humanities 1

20%

Save time finding and organizing research with Mendeley

Sign up for free
0