This paper presents a new evolutionary approach, EvoSplit, for the distribution of multilabel data sets into disjoint subsets for supervised machine learning. Currently, data set providers either divide a data set randomly or using iterative stratification, a method that aims to maintain the label (or label pair) distribution of the original data set into the different subsets. Following the same aim, this paper first introduces a single-objective evolutionary approach that tries to obtain a split that maximizes the similarity between those distributions independently. Second, a new multi-objective evolutionary algorithm is presented to maximize the similarity considering simultaneously both distributions (labels and label pairs). Both approaches are validated using well-known multi-label data sets as well as large image data sets currently used in computer vision and machine learning applications. EvoSplit improves the splitting of a data set in comparison to the iterative stratification following different measures: Label Distribution, Label Pair Distribution, Examples Distribution, folds and fold-label pairs with zero positive examples.
CITATION STYLE
Florez-Revuelta, F. (2021). Evosplit: An evolutionary approach to split a multi-label data set into disjoint subsets. Applied Sciences (Switzerland), 11(6). https://doi.org/10.3390/app11062823
Mendeley helps you to discover research relevant for your work.