Protein language models meet reduced amino acid alphabets

9Citations
Citations of this article
13Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Motivation: Protein language models (PLMs), which borrowed ideas for modelling and inference from natural language processing, have demonstrated the ability to extract meaningful representations in an unsupervised way. This led to significant performance improvement in several downstream tasks. Clustering amino acids based on their physical-chemical properties to achieve reduced alphabets has been of interest in past research, but their application to PLMs or folding models is unexplored. Results: Here, we investigate the efficacy of PLMs trained on reduced amino acid alphabets in capturing evolutionary information, and we explore how the loss of protein sequence information impacts learned representations and downstream task performance. Our empirical work shows that PLMs trained on the full alphabet and a large number of sequences capture fine details that are lost in alphabet reduction methods. We further show the ability of a structure prediction model(ESMFold) to fold CASP14 protein sequences translated using a reduced alphabet. For 10 proteins out of the 50 targets, reduced alphabets improve structural predictions with LDDT-Cα differences of up to 19%.

Cite

CITATION STYLE

APA

Ieremie, I., Ewing, R. M., & Niranjan, M. (2024). Protein language models meet reduced amino acid alphabets. Bioinformatics, 40(2). https://doi.org/10.1093/bioinformatics/btae061

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free