A comparison of two strategies for scaling up instance selection in huge datasets

0Citations
Citations of this article
3Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Instance selection is becoming more and more relevant due to the huge amount of data that is constantly being produced. However, although current algorithms are useful for fairly large datasets, many scaling problems are found when the number of instances is of hundred of thousands or millions. Most instance selection algorithms are of complexity at least O(n2), n being the number of instances. When we face huge problems, the scalability becomes an issue, and most of the algorithms are not applicable. Recently, two general methods for scaling up instance selection algorithms have been published in the literature: stratification and democratization. Both methods are able to successfully deal with large datasets. In this paper we show a comparison of these two methods when applied to very large and huge datasets up to 50,000,000 instances. Additionally, we also test their performance in huge datasets that are also class-imbalanced. The comparison is made using a parallel implementation of both methods to fully exploit their possibilities. Although both methods show very good behavior in terms of testing error, storage reduction and execution time, democratization proves an overall better performance. © 2011 Springer-Verlag.

Cite

CITATION STYLE

APA

De Haro-García, A., Pérez-Rodríguez, J., & García-Pedrajas, N. (2011). A comparison of two strategies for scaling up instance selection in huge datasets. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7023 LNAI, pp. 64–73). https://doi.org/10.1007/978-3-642-25274-7_7

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free