learnMSA: learning and aligning large protein families

9Citations
Citations of this article
19Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Background: The alignment of large numbers of protein sequences is a challenging task and its importance grows rapidly along with the size of biological datasets. State-of-the-art algorithms have a tendency to produce less accurate alignments with an increasing number of sequences. This is a fundamental problem since many downstream tasks rely on accurate alignments. Results: We present learnMSA, a novel statistical learning approach of profile hidden Markov models (pHMMs) based on batch gradient descent. Fundamentally different from popular aligners, we fit a custom recurrent neural network architecture for (p)HMMs to potentially millions of sequences with respect to a maximum a posteriori objective and decode an alignment. We rely on automatic differentiation of the log-likelihood, and thus, our approach is different from existing HMM training algorithms like Baum-Welch. Our method does not involve progressive, regressive, or divide-and-conquer heuristics. We use uniform batch sampling to adapt to large datasets in linear time without the requirement of a tree. When tested on ultra-large protein families with up to 3.5 million sequences, learnMSA is both more accurate and faster than state-of-the-art tools. On the established benchmarks HomFam and BaliFam with smaller sequence sets, it matches state-of-the-art performance. All experiments were done on a standard workstation with a GPU. Conclusions: Our results show that learnMSA does not share the counterintuitive drawback of many popular heuristic aligners, which can substantially lose accuracy when many additional homologs are input. LearnMSA is a future-proof framework for large alignments with many opportunities for further improvements.

Cite

CITATION STYLE

APA

Becker, F., & Stanke, M. (2022). learnMSA: learning and aligning large protein families. GigaScience, 11. https://doi.org/10.1093/gigascience/giac104

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free