The effect of taxonomic, host-dependent features and sample bias on virus host prediction using machine learning and short sequence k-mers

1Citations
Citations of this article
6Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Metaviromic studies of potential emerging infection reservoirs led to discovery of many novel viruses. Since metaviromes contain viruses from target host, its food or other sources, fast and robust approaches are needed to predict hosts of unknown viruses based on their genome data. Four machine learning algorithms (random forest, two gradient boosting machines, support vector machine) were used here to predict the hosts of RNA viruses that infect mammals, insects and plants. The prediction efficiency was largely dependent on the dataset composition. In the more challenging task of predicting hosts of unknown virus genera, median weighted F1-score of 0.79 was achieved using support vector machine and 4-mer frequencies, a notable improvement over baseline methods (median weighted F1-scores 0.68 for the homology-based tBLASTx and 0.72 for ML trained on mono-, di- and trinucleotide frequencies). More complicated features and feature combinations provided worse results. When predicting hosts of short virus sequence fragments quality decreased but using same-length fragments instead of full genomes for training consistently produced an improvement of prediction quality. Therefore, short k-mers carry sufficient information to predict hosts of novel RNA virus genera. This algorithm can be useful in rapid analysis of metaviromic data to highlight potential biological threats.

Cite

CITATION STYLE

APA

Perelygin, F. S., Lukashev, A. N., & Aleshina, Y. A. (2025). The effect of taxonomic, host-dependent features and sample bias on virus host prediction using machine learning and short sequence k-mers. Scientific Reports, 15(1). https://doi.org/10.1038/s41598-025-17123-w

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free