Similarity based hierarchical clustering with an application to text collections

Julien Ah-Pine; Xinyu Wang

Conference Proceedings

Similarity based hierarchical clustering with an application to text collections

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2016) 9897 LNCS 320-331

DOI: 10.1007/978-3-319-46349-0_28

9Citations

9Readers

Get full text

Abstract

Lance-Williams formula is a framework that unifies seven schemes of agglomerative hierarchical clustering. In this paper, we establish a new expression of this formula using cosine similarities instead of distances. We state conditions under which the new formula is equivalent to the original one. The interest of our approach is twofold. Firstly, we can naturally extend agglomerative hierarchical clustering techniques to kernel functions. Secondly, reasoning in terms of similarities allows us to design thresholding strategies on proximity values. Thereby, we propose to sparsify the similarity matrix in the goal of making these clustering techniques more efficient. We apply our approach to text clustering tasks. Our results show that sparsifying the inner product matrix considerably decreases memory usage and shortens running time while assuring the clustering quality.

Author supplied keywords

Cite

CITATION STYLE

APA

Ah-Pine, J., & Wang, X. (2016). Similarity based hierarchical clustering with an application to text collections. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9897 LNCS, pp. 320–331). Springer Verlag. https://doi.org/10.1007/978-3-319-46349-0_28

Similarity based hierarchical clustering with an application to text collections

Abstract

Author supplied keywords

Cite

Register to see more suggestions