Measures of dispersion and cluster-trees for categorical data

Ulrich Müller-Funk

Conference Proceedings

Measures of dispersion and cluster-trees for categorical data

Müller-Funk U

Studies in Classification, Data Analysis, and Knowledge Organization (2008) 163-170

DOI: 10.1007/978-3-540-78246-9_20

2Citations

2Readers

Get full text

Abstract

A clustering algorithm, in essence, is characterized by two features (1) the way in which the heterogeneity within resp. between clusters is measured (objective function) (2) the steps in which the splitting resp. fusioning proceeds. For categorical data there are no "standard indices" formalizing the first aspect. Instead, a number of ad hoc concepts have been used in cluster analysis, labelled "similarity", " information", "impurity" and the like. To clarify matters, we start out from a set of axioms summarizing our conception of "dispersion" for categorical attributes. To no surprise, it turns out, that some well-known measures, including the Gini index and the entropy, qualify as measures of dispersion. We try to indicate, how these measures can be used in unsupervised classification problems as well. Due to its simple analytic form, the Gini index allows for a dispersion-decomposition formula that can be made the starting point for a CART-like cluster tree. Trees are favoured because of i) factor selection and ii) communicability.

Cite

CITATION STYLE

APA

Müller-Funk, U. (2008). Measures of dispersion and cluster-trees for categorical data. In Studies in Classification, Data Analysis, and Knowledge Organization (pp. 163–170). Kluwer Academic Publishers. https://doi.org/10.1007/978-3-540-78246-9_20

Measures of dispersion and cluster-trees for categorical data

Abstract

Cite

Register to see more suggestions