Synonymous codons, which encode the same amino acid in a protein, are known to be used unequally in organisms. Prior research has been able to uncover "preferred"codons that are often found in more highly expressed genes. This has enabled different computational models that can predict gene expression of protein-coding genes; however, their performance is often affected by more diverse gene expression in higher organisms, i.e., high expression in only specific tissues or cell types. In this paper, we use a Natural Language Processing (NLP) algorithm, Bidirectional Encoder Representations from Transformers (BERT), to develop a new framework for predicting gene expression. Notably, our model architecture relies on the idea of sentiment analysis, i.e., assigning an overall "emotion"(sentiment) to protein-coding sequences. Our new framework, CodonBERT, is a a pre-trained model that better captures more intrinsic relationships between sequences and their expression, and we show that our model is capable of making substantially better predictions for a diverse collection of model organisms. Additionally, we show that our model learns inherent patterns of codon usage that can be traced using explainable AI (XAI) algorithms.
CITATION STYLE
Babjac, A. N., Lu, Z., & Emrich, S. J. (2023). CodonBERT: Using BERT for Sentiment Analysis to Better Predict Genes with Low Expression. In ACM-BCB 2023 - 14th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. Association for Computing Machinery, Inc. https://doi.org/10.1145/3584371.3613013
Mendeley helps you to discover research relevant for your work.