Recent work demonstrated a large ensemble of convolutional neural networks (CNNs) outperforms industry-standard approaches at annotating protein sequences that are far from the training data. These results highlight the potential of deep learning to significantly advance protein sequence annotation, but this particular system is not a practical tool for many biologists because of the computational burden of making predictions using a large ensemble. In this work, we fine-tune a transformer model that is pre-trained on millions of unlabeled natural protein sequences in order to reduce the system's compute burden at prediction time and improve accuracy. By switching from a CNN to the pre-trained transformer, we lift performance from 73.6% to 90.5% using a single model on a challenging clustering-based train-test split, where the ensemble of 59 CNNs achieved 89.0%. Through extensive stratified analysis of model performance, we provide evidence that the new model's predictions are trustworthy, even in cases known to be challenging for prior methods. Finally, we provide a case study of the biological insight enabled by this approach.
CITATION STYLE
Dohan, D., Gane, A., Bileschi, M. L., Belanger, D., & Colwell, L. (2021). Improving Protein Function Annotation via Unsupervised Pre-training: Robustness, Efficiency, and Insights. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 2782–2791). Association for Computing Machinery. https://doi.org/10.1145/3447548.3467163
Mendeley helps you to discover research relevant for your work.