Including categorical variables with many levels in a logistic regression model easily leads to a sparse design matrix. This can result in a big, ill-conditioned optimization problem causing overfitting, extreme coefficient values and long run times. Inspired by recent developments in matrix factorization, we propose four new strategies of overcoming this problem. Each strategy uses a Factorization Machine that transforms the categorical variables with many levels into a few numeric variables that are subsequently used in the logistic regression model. The application of Factorization Machines also allows for including interactions between the categorical variables with many levels, often substantially increasing model accuracy. The four strategies have been tested on four data sets, demonstrating superiority of our approach over other methods of handling categorical variables with many levels. In particular, our approach has been successfully used for developing high quality risk models at the Netherlands Tax and Customs Administration.
CITATION STYLE
Pijnenburg, M., & Kowalczyk, W. (2017). Extending logistic regression models with factorization machines. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10352 LNAI, pp. 323–332). Springer Verlag. https://doi.org/10.1007/978-3-319-60438-1_32
Mendeley helps you to discover research relevant for your work.