Abstract
Analysis of genomic and metagenomic sequences is inherently more challenging than that of amino acid sequences due to the higher divergence among evolutionarily related nucleotide sequences, variable k-mer and codon usage within and among genomes of diverse species, and poorly understood selective constraints. We introduce Scorpio (Sequence Contrastive Optimization for Representation and Predictive Inference on DNA), a versatile framework designed for nucleotide sequences that employ contrastive learning to improve embeddings. By leveraging pre-trained genomic language models and k-mer frequency embeddings, Scorpio demonstrates competitive performance in diverse applications, including taxonomic and gene classification, antimicrobial resistance (AMR) gene identification, and promoter detection. A key strength of Scorpio is its ability to generalize to novel DNA sequences and taxa, addressing a significant limitation of alignment-based methods. Scorpio has been tested on multiple datasets with DNA sequences of varying lengths (long and short) and shows robust inference capabilities. Additionally, we provide an analysis of the biological information underlying this representation, including correlations between codon adaptation index as a gene expression factor, sequence similarity, and taxonomy, as well as the functional and structural information of genes.
Cite
CITATION STYLE
Refahi, M., Sokhansanj, B. A., Mell, J. C., Brown, J. R., Yoo, H., Hearne, G., & Rosen, G. L. (2025). Enhancing nucleotide sequence representations in genomic analysis with contrastive optimization. Communications Biology, 8(1). https://doi.org/10.1038/s42003-025-07902-6
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.