Most of the datasets used in Machine Learning (ML) tasks contain categorical attributes. In practice, these attributes must be numerically encoded for their use in supervised learning algorithms. Although there are several encoding techniques, the most commonly used ones do not necessarily preserve possible patterns embedded in the data when they are applied inappropriately. This potential loss of information affects the performance of ML algorithms in automated learning tasks. In this paper, a comparative study is presented to measure how the different encoding techniques affect the performance of machine learning models. We test 10 encoding methods, using 5 ML algorithms on real and synthetic data. Furthermore, we propose a novel approach that uses synthetically created datasets that allows us to know a priori the relationship between the independent and the dependent variables, which implies a more precise measurement of the encoding techniques’ impact. We show that some ML models are affected negatively or positively depending on the encoding technique used. We also show that the proposed approach is more easily controlled and faster when performing experiments on categorical encoders.
CITATION STYLE
Valdez-Valenzuela, E., Kuri-Morales, A., & Gomez-Adorno, H. (2021). Measuring the Effect of Categorical Encoders in Machine Learning Tasks Using Synthetic Data. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 13067 LNAI, pp. 92–107). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-89817-5_7
Mendeley helps you to discover research relevant for your work.