Abstract
Similarity between complex data objects is one of the central notions in data mining. We propose certain similarity (or distance) measures between various components of a 0/1 relation. We define measures between attributes, between rows, and between subrelations of the database. They find important applications in clustering, classification, and several other data mining processes. Our measures are based on the contexts of individual components. For example, two products (i.e., attributes) are deemed similar if their respective sets of customers (i.e., subrelations) are similar. This reveals more subtle relationships between components, something that is usually missing in simpler measures. Our problem of finding distance measures can be formulated as a system of nonlinear equations. We present an iterative algorithm which, when seeded with random initial values, converges quickly to stable distances in practice (typically requiring less than five iterations). The algorithm requires only one database scan. Results on artificial and real data show that our method is efficient, and produces results with intuitive appeal.
Cite
CITATION STYLE
Das, G., & Mannila, H. (2000). Context-based similarity measures for categorical databases. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 1910, pp. 201–210). Springer Verlag. https://doi.org/10.1007/3-540-45372-5_20
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.