Large language models generate functional protein sequences across diverse families

Ali Madani; Ben Krause; Eric R. Greene; Subu Subramanian; Benjamin P. Mohr; James M. Holton; Jose Luis Olmos; Caiming Xiong; Zachary Z. Sun; Richard Socher; James S. Fraser; Nikhil Naik

Journal Article

Large language models generate functional protein sequences across diverse families

Nature Biotechnology (2023) 41(8) 1099-1106

DOI: 10.1038/s41587-022-01618-2

162Citations

540Readers

Get full text

Abstract

Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here we describe ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. The model was trained on 280 million protein sequences from >19,000 families and is augmented with control tags specifying protein properties. ProGen can be further fine-tuned to curated sequences and tags to improve controllable generation performance of proteins from families with sufficient homologous samples. Artificial proteins fine-tuned to five distinct lysozyme families showed similar catalytic efficiencies as natural lysozymes, with sequence identity to natural proteins as low as 31.4%. ProGen is readily adapted to diverse protein families, as we demonstrate with chorismate mutase and malate dehydrogenase.

Cite

CITATION STYLE

APA

Madani, A., Krause, B., Greene, E. R., Subramanian, S., Mohr, B. P., Holton, J. M., … Naik, N. (2023). Large language models generate functional protein sequences across diverse families. Nature Biotechnology, 41(8), 1099–1106. https://doi.org/10.1038/s41587-022-01618-2

Large language models generate functional protein sequences across diverse families

Abstract

Cite

Register to see more suggestions