In this issue, Naimi et al. (Am J Epidemiol. 2023;192(9):1536-1544) discuss a critical topic in public health and beyond: obtaining valid statistical inference when using machine learning in causal research. In doing so, the authors review recent prominent methodological work and recommend: 1) doubly robust estimators, such as targeted maximum likelihood estimation (TMLE); 2) ensemble methods, such as Super Learner, to combine predictions from a diverse library of algorithms; and 3) sample splitting to reduce bias and improve inference. We largely agree with these recommendations. In this commentary, we highlight the critical importance of the Super Learner library. Specifically, in both simulation settings considered by the authors, we demonstrate that reductions in bias and improvements in confidence-interval coverage can be achieved using TMLE without sample splitting and with a Super Learner library that excludes tree-based methods but includes regression splines. Whether extremely data-adaptive algorithms and sample splitting are needed depends on the specific problem and should be informed by simulations reflecting the specific application. More research is needed on practical recommendations for selecting among these options in common situations arising in epidemiology.
CITATION STYLE
Balzer, L. B., & Westling, T. (2023, September 1). Invited Commentary: Demystifying Statistical Inference When Using Machine Learning in Causal Research. American Journal of Epidemiology. Oxford University Press. https://doi.org/10.1093/aje/kwab200
Mendeley helps you to discover research relevant for your work.