Measuring the Effect of Categorical Encoders in Machine Learning Tasks Using Synthetic Data

Eric Valdez-Valenzuela; Angel Kuri-Morales; Helena Gomez-Adorno

Conference Proceedings

Measuring the Effect of Categorical Encoders in Machine Learning Tasks Using Synthetic Data

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2021) 13067 LNAI 92-107

DOI: 10.1007/978-3-030-89817-5_7

2Citations

6Readers

Get full text

Abstract

Most of the datasets used in Machine Learning (ML) tasks contain categorical attributes. In practice, these attributes must be numerically encoded for their use in supervised learning algorithms. Although there are several encoding techniques, the most commonly used ones do not necessarily preserve possible patterns embedded in the data when they are applied inappropriately. This potential loss of information affects the performance of ML algorithms in automated learning tasks. In this paper, a comparative study is presented to measure how the different encoding techniques affect the performance of machine learning models. We test 10 encoding methods, using 5 ML algorithms on real and synthetic data. Furthermore, we propose a novel approach that uses synthetically created datasets that allows us to know a priori the relationship between the independent and the dependent variables, which implies a more precise measurement of the encoding techniques’ impact. We show that some ML models are affected negatively or positively depending on the encoding technique used. We also show that the proposed approach is more easily controlled and faster when performing experiments on categorical encoders.

Author supplied keywords

Cite

CITATION STYLE

APA

Valdez-Valenzuela, E., Kuri-Morales, A., & Gomez-Adorno, H. (2021). Measuring the Effect of Categorical Encoders in Machine Learning Tasks Using Synthetic Data. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 13067 LNAI, pp. 92–107). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-89817-5_7

Measuring the Effect of Categorical Encoders in Machine Learning Tasks Using Synthetic Data

Abstract

Author supplied keywords

Cite

Register to see more suggestions