Measuring the Effect of Categorical Encoders in Machine Learning Tasks Using Synthetic Data

2Citations
Citations of this article
6Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Most of the datasets used in Machine Learning (ML) tasks contain categorical attributes. In practice, these attributes must be numerically encoded for their use in supervised learning algorithms. Although there are several encoding techniques, the most commonly used ones do not necessarily preserve possible patterns embedded in the data when they are applied inappropriately. This potential loss of information affects the performance of ML algorithms in automated learning tasks. In this paper, a comparative study is presented to measure how the different encoding techniques affect the performance of machine learning models. We test 10 encoding methods, using 5 ML algorithms on real and synthetic data. Furthermore, we propose a novel approach that uses synthetically created datasets that allows us to know a priori the relationship between the independent and the dependent variables, which implies a more precise measurement of the encoding techniques’ impact. We show that some ML models are affected negatively or positively depending on the encoding technique used. We also show that the proposed approach is more easily controlled and faster when performing experiments on categorical encoders.

Cite

CITATION STYLE

APA

Valdez-Valenzuela, E., Kuri-Morales, A., & Gomez-Adorno, H. (2021). Measuring the Effect of Categorical Encoders in Machine Learning Tasks Using Synthetic Data. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 13067 LNAI, pp. 92–107). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-89817-5_7

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free