Context-based similarity measures for categorical databases

Gautam Das; Heikki Mannila

Conference ProceedingsOPEN ACCESS

Context-based similarity measures for categorical databases

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2000) 1910 201-210

DOI: 10.1007/3-540-45372-5_20

42Citations

24Readers

Abstract

Similarity between complex data objects is one of the central notions in data mining. We propose certain similarity (or distance) measures between various components of a 0/1 relation. We define measures between attributes, between rows, and between subrelations of the database. They find important applications in clustering, classification, and several other data mining processes. Our measures are based on the contexts of individual components. For example, two products (i.e., attributes) are deemed similar if their respective sets of customers (i.e., subrelations) are similar. This reveals more subtle relationships between components, something that is usually missing in simpler measures. Our problem of finding distance measures can be formulated as a system of nonlinear equations. We present an iterative algorithm which, when seeded with random initial values, converges quickly to stable distances in practice (typically requiring less than five iterations). The algorithm requires only one database scan. Results on artificial and real data show that our method is efficient, and produces results with intuitive appeal.

Cite

CITATION STYLE

APA

Das, G., & Mannila, H. (2000). Context-based similarity measures for categorical databases. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 1910, pp. 201–210). Springer Verlag. https://doi.org/10.1007/3-540-45372-5_20

Context-based similarity measures for categorical databases

Abstract

Cite

Register to see more suggestions