ReForeSt: Random forests in apache spark

Alessandro Lulli; Luca Oneto; Davide Anguita

Conference Proceedings

ReForeSt: Random forests in apache spark

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2017) 10614 LNCS 331-339

DOI: 10.1007/978-3-319-68612-7_38

4Citations

7Readers

Get full text

Abstract

Random Forests (RF) of tree classifiers are a popular ensemble method for classification. RF are usually preferred with respect to other classification techniques because of their limited hyperparameter sensitivity, high numerical robustness, native capacity of dealing with numerical and categorical features, and effectiveness in many real world classification problems. In this work we present ReForeSt, a Random Forests Apache Spark implementation which is easier to tune, faster, and less memory consuming with respect to MLlib, the de facto standard Apache Spark machine learning library. We perform an extensive comparison between ReForeSt and MLlib by taking advantage of the Google Cloud Platform (https://cloud.google.com). In particular, we test ReForeSt and MLlib with different library settings, on different real world datasets, and with a different number of machines equipped with different number of cores. Results confirm that ReForeSt outperforms MLlib in all the above mentioned aspects. ReForeSt is made publicly available via GitHub (https://github.com/alessandrolulli/reforest).

Author supplied keywords

Cite

CITATION STYLE

APA

Lulli, A., Oneto, L., & Anguita, D. (2017). ReForeSt: Random forests in apache spark. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10614 LNCS, pp. 331–339). Springer Verlag. https://doi.org/10.1007/978-3-319-68612-7_38

ReForeSt: Random forests in apache spark

Abstract

Author supplied keywords

Cite

Register to see more suggestions