Efficiently determining the starting sample size for progressive sampling

Baohua Gu; Bing Liu; Feifang Hu; Huan Liu

Conference ProceedingsOPEN ACCESS

Efficiently determining the starting sample size for progressive sampling

Gu B
Liu B
Hu F
et al.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2001) 2167 192-202

DOI: 10.1007/3-540-44795-4_17

15Citations

18Readers

Abstract

Given a large data set and a classification learning algorithm, Progressive Sampling (PS) uses increasingly larger random samples to learn until model accuracy no longer improves. It is shown that the technique is remarkably efficient compared to using the entire data. However, how to set the starting sample size for PS is still an open problem. We show that an improper starting sample size can still make PS expensive in computation due to running the learning algorithm on a large number of instances (of a sequence of random samples before achieving convergence) and excessive database scans to fetch the sample data. Using a suitable starting sample size can further improve the efficiency of PS. In this paper, we present a statistical approach which is able to efficiently find such a size. We call it the Statistical Optimal Sample Size (SOSS), in the sense that a sample of this size sufficiently resembles the entire data. We introduce an information-based measure of this resemblance (Sample Quality) to define the SOSS and show that it can be efficiently obtained in one scan of the data. We prove that learning on a sample of SOSS will produce model accuracy that asymptotically approaches the highest achievable accuracy on the entire data. Empirical results on a number of large data sets from the UCIKDD repository show that SOSS is a suitable starting size for Progressive Sampling.

References Powered by Scopus

View more at Scopus

Cited by Powered by Scopus

View more at Scopus

Cite

CITATION STYLE

APA

Gu, B., Liu, B., Hu, F., & Liu, H. (2001). Efficiently determining the starting sample size for progressive sampling. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 2167, pp. 192–202). Springer Verlag. https://doi.org/10.1007/3-540-44795-4_17

Readers over time

Readers' Seniority

PhD / Post grad / Masters / Doc 5

36%

Researcher 4

29%

Lecturer / Post doc 3

21%

Professor / Associate Prof. 2

14%

Readers' Discipline

Computer Science 10

77%

Nursing and Health Professions 1

Engineering 1

Agricultural and Biological Sciences 1

Efficiently determining the starting sample size for progressive sampling

Abstract

References Powered by Scopus

CURE: An efficient clustering algorithm for large databases

Data mining: An overview from a database perspective

BOAT - Optimistic Decision Tree Construction

Cited by Powered by Scopus

A review of automatic selection methods for machine learning algorithms and hyper-parameter values

Bayesian calibration of building energy models with large datasets

Accelerating fuzzy-c means using an estimated subsample size

Register to see more suggestions

Cite

Readers over time

Readers' Seniority

Readers' Discipline