Tutorial 3. Clustering Techniques for Large Data Sets - From the Past to the Future

27Citations
Citations of this article
19Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Because of the fast technological progress, the amount of information which is stored in databases is rapidly increasing. In addition, new applications require the storage and retrieval of complex multimedia objects which are often represented by high-dimensional feature vectors. Finding the valuable information hidden in those databases is a difficult task. Cluster analysis is one of the basic techniques which is often applied in analyzing large data sets. Originating from the area of statistics, most cluster analysis algorithms have originally been developed for relatively small data sets. In the recent years, the clustering algorithms have been extended to efficiently work on large data sets, and some of them even allow the clustering of high-dimensional feature vectors. Many such methods use some kind of an index structure for an efficient retrieval of the required data; other approaches are based on preprocessing for a more efficient clustering. The main goal of the tutorial is to provide an overview of the state-of-The-Art in cluster discovery methods for large databases, covering well-known clustering methods from related fields such as statistics, pattern recognition, and machine learning, as well as database techniques which allow them to work efficiently on large databases. The target audience of the tutorial are researchers and practitioners from statistics, databases, and machine learning, who are interested in the state-of-The art of cluster discovery methods and their applications to large databases. The tutorial especially addresses people from academia who are interested in developing new cluster discovery algorithms, and people from industry who want to apply cluster discovery methods in analyzing large databases. The tutorial is structured as follows: First, we give a brief motivation for clustering from modern data mining applications. We discuss important design decisions and explain the interdependencies with the properties of data. We then introduce a large variety of clustering methods and classify them into four groups - model-And optimization-based methods, linkage-based methods, density-based methods, and hybrid methods. A detailed comparison shows the strength and weaknesses of the existing techniques and reveals potentials for further improvements. We discuss database techniques which have been proposed to improve the effectiveness and efficiency of the cluster discover process. The four main categories of techniques which can be used for this purpose are hierarchical and incremental approaches, multidimensional indexing, and sampling. The tutorial concludes with a discussion of open problems and future research issues.

Cite

CITATION STYLE

APA

Keim, D. A., & Hinneburg, A. (1999). Tutorial 3. Clustering Techniques for Large Data Sets - From the Past to the Future. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Vol. Part F129196, pp. 141–181). Association for Computing Machinery. https://doi.org/10.1145/312179.312189

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free