Improving Protein Function Annotation via Unsupervised Pre-training: Robustness, Efficiency, and Insights

5Citations
Citations of this article
22Readers
Mendeley users who have this article in their library.

Abstract

Recent work demonstrated a large ensemble of convolutional neural networks (CNNs) outperforms industry-standard approaches at annotating protein sequences that are far from the training data. These results highlight the potential of deep learning to significantly advance protein sequence annotation, but this particular system is not a practical tool for many biologists because of the computational burden of making predictions using a large ensemble. In this work, we fine-tune a transformer model that is pre-trained on millions of unlabeled natural protein sequences in order to reduce the system's compute burden at prediction time and improve accuracy. By switching from a CNN to the pre-trained transformer, we lift performance from 73.6% to 90.5% using a single model on a challenging clustering-based train-test split, where the ensemble of 59 CNNs achieved 89.0%. Through extensive stratified analysis of model performance, we provide evidence that the new model's predictions are trustworthy, even in cases known to be challenging for prior methods. Finally, we provide a case study of the biological insight enabled by this approach.

Cite

CITATION STYLE

APA

Dohan, D., Gane, A., Bileschi, M. L., Belanger, D., & Colwell, L. (2021). Improving Protein Function Annotation via Unsupervised Pre-training: Robustness, Efficiency, and Insights. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 2782–2791). Association for Computing Machinery. https://doi.org/10.1145/3447548.3467163

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free