edarf: Exploratory Data Analysis using Random Forests

  • M. Jones Z
  • J. Linder F
Citations of this article
Mendeley users who have this article in their library.


Although the rise of "big data" has made machine learning algorithms more visible and relevant for social scientists, they are still widely considered to be "black box" models that are not well suited for substantive research: only prediction. We argue that this need not be the case, and present one method, Random Forests, with an emphasis on its practical application for exploratory analysis and substantive interpretation. Random Forests detect interaction and nonlinearity without prespecification, have low generalization error in simulations and in many real-world problems, and can be used with many correlated predictors, even when there are more predictors than observations. Importantly, Random Forests can be interpreted in a substantively relevant way with variable importance measures, bivariate and multivariate partial dependence, proximity matrices, and methods for interaction detection. We provide intuition as well as technical detail about how Random Forests work, in theory and in practice, as well as empirical examples from the literature on American and comparative politics. Furthermore, we provide software implementing the methods we discuss, in order to facilitate their use.




M. Jones, Z., & J. Linder, F. (2016). edarf: Exploratory Data Analysis using Random Forests. The Journal of Open Source Software, 1(6), 92. https://doi.org/10.21105/joss.00092

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free