Abstract
Motivation: In recent years, pre-training with the transformer architecture has gained significant attention. While this approach has led to notable performance improvements across a variety of downstream tasks, the underlying mechanisms by which pre-training models influence these tasks, particularly in the context of biological data, are not yet fully elucidated. Results: In this study, focusing on the pre-training on nucleotide sequences, we decompose a pre-training model of Bidirectional Encoder Representations from Transformers (BERT) into its embedding and encoding modules to analyze what a pre-trained model learns from nucleotide sequences. Through a comparative study of non-standard pre-training at both the data and model levels, we find that a typical BERT model learns to capture overlapping-consistent k-mer embeddings for its token representation within its embedding module. Interestingly, using the k-mer embeddings pre-trained on random data can yield similar performance in downstream tasks, when compared with those using the k-mer embeddings pre-trained on real biological sequences. We further compare the learned k-mer embeddings with other established k-mer representations in downstream tasks of sequence-based functional prediction. Our experimental results demonstrate that the dense representation of k-mers learned from pre-training can be used as a viable alternative to one-hot encoding for representing nucleotide sequences. Furthermore, integrating the pre-trained k-mer embeddings with simpler models can achieve competitive performance in two typical downstream tasks.
Cite
CITATION STYLE
Zhang, Y. Z., Bai, Z., & Imoto, S. (2023). Investigation of the BERT model on nucleotide sequences with non-standard pre-training and evaluation of different k-mer embeddings. Bioinformatics, 39(10). https://doi.org/10.1093/bioinformatics/btad617
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.