Improving Protein Function Annotation via Unsupervised Pre-training: Robustness, Efficiency, and Insights

David Dohan; Andreea Gane; Maxwell L. Bileschi; David Belanger; Lucy Colwell

Conference ProceedingsOPEN ACCESS

Improving Protein Function Annotation via Unsupervised Pre-training: Robustness, Efficiency, and Insights

Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2021) 2782-2791

DOI: 10.1145/3447548.3467163

5Citations

22Readers

Abstract

Recent work demonstrated a large ensemble of convolutional neural networks (CNNs) outperforms industry-standard approaches at annotating protein sequences that are far from the training data. These results highlight the potential of deep learning to significantly advance protein sequence annotation, but this particular system is not a practical tool for many biologists because of the computational burden of making predictions using a large ensemble. In this work, we fine-tune a transformer model that is pre-trained on millions of unlabeled natural protein sequences in order to reduce the system's compute burden at prediction time and improve accuracy. By switching from a CNN to the pre-trained transformer, we lift performance from 73.6% to 90.5% using a single model on a challenging clustering-based train-test split, where the ensemble of 59 CNNs achieved 89.0%. Through extensive stratified analysis of model performance, we provide evidence that the new model's predictions are trustworthy, even in cases known to be challenging for prior methods. Finally, we provide a case study of the biological insight enabled by this approach.

Author supplied keywords

Cite

CITATION STYLE

APA

Dohan, D., Gane, A., Bileschi, M. L., Belanger, D., & Colwell, L. (2021). Improving Protein Function Annotation via Unsupervised Pre-training: Robustness, Efficiency, and Insights. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 2782–2791). Association for Computing Machinery. https://doi.org/10.1145/3447548.3467163

Improving Protein Function Annotation via Unsupervised Pre-training: Robustness, Efficiency, and Insights

Abstract

Author supplied keywords

Cite

Register to see more suggestions