Protein language models meet reduced amino acid alphabets

Ioan Ieremie; Rob M. Ewing; Mahesan Niranjan

Journal ArticleOPEN ACCESS

Protein language models meet reduced amino acid alphabets

Bioinformatics (2024) 40(2)

DOI: 10.1093/bioinformatics/btae061

9Citations

13Readers

Abstract

Motivation: Protein language models (PLMs), which borrowed ideas for modelling and inference from natural language processing, have demonstrated the ability to extract meaningful representations in an unsupervised way. This led to significant performance improvement in several downstream tasks. Clustering amino acids based on their physical-chemical properties to achieve reduced alphabets has been of interest in past research, but their application to PLMs or folding models is unexplored. Results: Here, we investigate the efficacy of PLMs trained on reduced amino acid alphabets in capturing evolutionary information, and we explore how the loss of protein sequence information impacts learned representations and downstream task performance. Our empirical work shows that PLMs trained on the full alphabet and a large number of sequences capture fine details that are lost in alphabet reduction methods. We further show the ability of a structure prediction model(ESMFold) to fold CASP14 protein sequences translated using a reduced alphabet. For 10 proteins out of the 50 targets, reduced alphabets improve structural predictions with LDDT-Cα differences of up to 19%.

Cite

CITATION STYLE

APA

Ieremie, I., Ewing, R. M., & Niranjan, M. (2024). Protein language models meet reduced amino acid alphabets. Bioinformatics, 40(2). https://doi.org/10.1093/bioinformatics/btae061

Protein language models meet reduced amino acid alphabets

Abstract

Cite

Register to see more suggestions