Efficiently determining the starting sample size for progressive sampling

15Citations
Citations of this article
18Readers
Mendeley users who have this article in their library.

Abstract

Given a large data set and a classification learning algorithm, Progressive Sampling (PS) uses increasingly larger random samples to learn until model accuracy no longer improves. It is shown that the technique is remarkably efficient compared to using the entire data. However, how to set the starting sample size for PS is still an open problem. We show that an improper starting sample size can still make PS expensive in computation due to running the learning algorithm on a large number of instances (of a sequence of random samples before achieving convergence) and excessive database scans to fetch the sample data. Using a suitable starting sample size can further improve the efficiency of PS. In this paper, we present a statistical approach which is able to efficiently find such a size. We call it the Statistical Optimal Sample Size (SOSS), in the sense that a sample of this size sufficiently resembles the entire data. We introduce an information-based measure of this resemblance (Sample Quality) to define the SOSS and show that it can be efficiently obtained in one scan of the data. We prove that learning on a sample of SOSS will produce model accuracy that asymptotically approaches the highest achievable accuracy on the entire data. Empirical results on a number of large data sets from the UCIKDD repository show that SOSS is a suitable starting size for Progressive Sampling.

References Powered by Scopus

CURE: An efficient clustering algorithm for large databases

2036Citations
N/AReaders
Get full text

Data mining: An overview from a database perspective

1644Citations
N/AReaders
Get full text

BOAT - Optimistic Decision Tree Construction

243Citations
N/AReaders
Get full text

Cited by Powered by Scopus

A review of automatic selection methods for machine learning algorithms and hyper-parameter values

275Citations
N/AReaders
Get full text

Bayesian calibration of building energy models with large datasets

97Citations
N/AReaders
Get full text

Accelerating fuzzy-c means using an estimated subsample size

78Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Gu, B., Liu, B., Hu, F., & Liu, H. (2001). Efficiently determining the starting sample size for progressive sampling. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 2167, pp. 192–202). Springer Verlag. https://doi.org/10.1007/3-540-44795-4_17

Readers over time

‘09‘10‘11‘12‘13‘16‘18‘19‘20‘21‘2201234

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 5

36%

Researcher 4

29%

Lecturer / Post doc 3

21%

Professor / Associate Prof. 2

14%

Readers' Discipline

Tooltip

Computer Science 10

77%

Nursing and Health Professions 1

8%

Engineering 1

8%

Agricultural and Biological Sciences 1

8%

Save time finding and organizing research with Mendeley

Sign up for free
0