Data reduction via adaptive sampling

Xiao-Bai Li

Journal ArticleOPEN ACCESS

Data reduction via adaptive sampling

Li X

Communications in Information and Systems (2002) 2(1) 53-68

DOI: 10.4310/cis.2002.v2.n1.a3

N/ACitations

16Readers

Abstract

Data reduction is an important issue in the field of data mining. This article de-scribes a new method for selecting a subset of data from a large dataset. A simplified chi-square criterion is proposed for measuring the goodness-of-fit between the distributions of the reduced and full data sets. Under this criterion, the data reduction problem can be formulated as a binary quadratic program and a tabu search technique is used in the search/optimization process. The procedure is adaptive in that it involves not only random sampling but also deterministic search guided by the results of the previous search steps. The method is applicable primarily to discrete data, but can be extended to continuous data as well. An experimental study that compares the proposed method with simple random sampling on a number of simulated and real world datasets has been conducted. The results of the study indicate that the distributions of the samples produced by the proposed method are significantly closer to the true distribution than those of random samples. 1. Introduction. In recent years, we have observed an explosion of electronic data generated and collected by individuals, corporations, and government agencies. It was estimated several years ago that the amount of data in the world was dou-bling every twenty months [5]. By current standards, that estimate is no doubt too conservative. The widespread use of bar codes and scanning devices for commercial products, the computerization of business and government transactions, the rapid development of electronic commence over the Internet, and the advances in storage technology and database management systems have allowed us to generate and store mountains of data. This rapid growth in data and databases has created the problem of data overload. There has been an urgent need for new techniques and tools that can extract useful information and knowledge from massive volumes of data. Conse-quently, an emerging field, known as data mining, has flourished in the past several years [4]. Data mining is the process of discovering hidden patterns in databases. The entire process includes (loosely) three steps: (1) data preparation, which includes data collection, data cleaning, data reduction and data transformation; (2) pattern exploration, which involves developing (or using existing) algorithms and computer programs to discover the patterns of interest; and (3) implementation, in which the patterns discovered in the previous step are used to solve real world problems such as credit evaluation, fraud detection, and customer relationship management. Although it is commonly acknowledged that data preparation is often the most involved and potentially most important step in the data mining process, there have been

Cite

CITATION STYLE

APA

Li, X.-B. (2002). Data reduction via adaptive sampling. Communications in Information and Systems, 2(1), 53–68. https://doi.org/10.4310/cis.2002.v2.n1.a3

Data reduction via adaptive sampling

Abstract

Cite

Register to see more suggestions