The ability to classify patients based on gene-expression data varies by algorithm and performance metric

Stephen R. Piccolo; Avery Mecham; Nathan P. Golightly; Jérémie L. Johnson; Dustin B. Miller

Journal ArticleOPEN ACCESS

The ability to classify patients based on gene-expression data varies by algorithm and performance metric

PLoS Computational Biology (2022) 18(3)

DOI: 10.1371/journal.pcbi.1009926

15Citations

44Readers

Get full text

Abstract

By classifying patients into subgroups, clinicians can provide more effective care than using a uniform approach for all patients. Such subgroups might include patients with a particular disease subtype, patients with a good (or poor) prognosis, or patients most (or least) likely to respond to a particular therapy. Transcriptomic measurements reflect the downstream effects of genomic and epigenomic variations. However, high-throughput technologies generate thousands of measurements per patient, and complex dependencies exist among genes, so it may be infeasible to classify patients using traditional statistical models. Machine-learning classification algorithms can help with this problem. However, hundreds of classification algorithms exist—and most support diverse hyperparameters—so it is difficult for researchers to know which are optimal for gene-expression biomarkers. We performed a benchmark comparison, applying 52 classification algorithms to 50 gene-expression datasets (143 class variables). We evaluated algorithms that represent diverse machine-learning methodologies and have been implemented in general-purpose, open-source, machine-learning libraries. When available, we combined clinical predictors with gene-expression data. Additionally, we evaluated the effects of performing hyperparameter optimization and feature selection using nested cross validation. Kernel-@@@@@and ensemble-based algorithms consistently outperformed other types of classification algorithms; however, even the top-performing algorithms performed poorly in some cases. Hyperparameter optimization and feature selection typically improved predictive performance, and univariate feature-selection algorithms typically outperformed more sophisticated methods. Together, our findings illustrate that algorithm performance varies considerably when other factors are held constant and thus that algorithm selection is a critical step in biomarker studies.

Cite

CITATION STYLE

APA

Piccolo, S. R., Mecham, A., Golightly, N. P., Johnson, J. L., & Miller, D. B. (2022). The ability to classify patients based on gene-expression data varies by algorithm and performance metric. PLoS Computational Biology, 18(3). https://doi.org/10.1371/journal.pcbi.1009926

The ability to classify patients based on gene-expression data varies by algorithm and performance metric

Abstract

Cite

Register to see more suggestions