Abstract
Transformer-based models have achieved remarkable success in biological sequence modeling, yet their application to RNA remains constrained by sequence length limitations. Existing RNA language models often truncate inputs, discarding distal nucleotide context crucial for full-length tasks. Additionally, advanced NLP tokenization methods do not directly apply to biological sequences, where nucleotide-level resolution is essential for tasks like secondary structure prediction. To address these challenges, we introduce BiRNA-BERT, a 117M-parameter Transformer encoder trained on 36 million non-coding RNA sequences. At its core is an adaptive dual-tokenization framework that combines nucleotide-level (NUC) encoding for fine-grained structural tasks with byte-pair encoding (BPE) for efficient long-sequence processing. BiRNA-BERT dynamically selects tokenization based on input length, enabling it to process arbitrarily long sequences without truncation. We demonstrate state-of-the-art performance across tasks ranging from short-sequence classification to long-context modeling and fine-grained nucleotide level RNA structural prediction. Our information-theoretic analysis reveals the trade-offs between BPE compression and NUC tokenization, which we again validate empirically. Finally, BiRNA-BERT achieves strong intrinsic language modeling performance–measured by perplexity and token recovery–while remaining more compact than existing RNA models. The code and model weights are available at https://github.com/buetnlpbio/BiRNA-BERT.
Cite
CITATION STYLE
Tahmid, M. T., Shahgir, H. S., Mahbub, S., Dong, Y., & Bayzid, M. S. (2025). BiRNA-BERT allows efficient RNA language modeling with adaptive tokenization. Communications Biology, 8(1). https://doi.org/10.1038/s42003-025-08982-0
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.