Abstract
A common difficulty in data analysis is how to handle categorical pre-dictors with a large number of levels or categories. Few proposals have been developed to tackle this important and frequent problem. We introduce a generative model that simultaneously carries out the model fitting and the aggre-gation of the categorical levels into larger groups. We represent the categorical predictor by a graph where the nodes are the categories and establish a probability distribution over meaningful partitions of this graph. Condition-ally on the observed data, we obtain a posterior distribution for the levels ag-gregation, allowing the inference about the most probable clustering for the categories. Simultaneously, we extract inference about all the other regression model parameters. We compare our and state-of-art methods showing that it has equally good predictive performance and more interpretable results. Our approach balances out accuracy vs. interpretability, a current important con-cern in statistics and machine learning.
Author supplied keywords
Cite
CITATION STYLE
Criscuolo, T. L., Assunção, R. M., Loschi, R. H., Meira, W., & Cruz-Reyes, D. (2023). HANDLING CATEGORICAL FEATURES WITH MANY LEVELS USING A PRODUCT PARTITION MODEL. Annals of Applied Statistics, 17(1), 786–814. https://doi.org/10.1214/22-AOAS1651
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.