Numero: A statistical framework to define multivariable subgroups in complex population-based datasets

17Citations
Citations of this article
29Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Large-scale epidemiological and population data provide opportunities to identify subgroups of people who are at risk of disease or exposed to adverse environments. Clustering algorithms are popular data-driven tools to identify these subgroups; however, relying exclusively on algorithms may not produce the best results if the dataset does not have a clustered structure. For this reason, we propose a framework (the R-library Numero) that combines the self-organizing map algorithm, permutation analysis for statistical evidence and a final expert-driven subgrouping step. We used Numero to define subgroups in two examples without an obvious clustering structure: a biomedical dataset of kidney disease and another dataset of community-level socioeconomic indicators. We benchmarked the Numero subgroupings against popular clustering algorithms (principal components, K-means and hierarchical clustering). The Numero subgroupings were more intuitive and easier to interpret without losing mathematical quality. Therefore, we expect Numero to be useful for exploratory analyses of population-based epidemiological datasets.

Cite

CITATION STYLE

APA

Gao, S., Mutter, S., Casey, A., & Mäkinen, V. P. (2019). Numero: A statistical framework to define multivariable subgroups in complex population-based datasets. International Journal of Epidemiology, 48(2), 369–374. https://doi.org/10.1093/ije/dyy113

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free