Clustering is crucial for many NLP tasks and applications. However, evaluating the results of a clustering algorithm is hard. In this paper we focus on the evaluation setting in which a gold standard solution is available. We discuss two existing information theory based measures, V and VI, and show that they are both hard to use when comparing the performance of different algorithms and different datasets. The V measure favors solutions having a large number of clusters, while the range of scores given by VI depends on the size of the dataset. We present a new measure, NVI, which normalizes VI to address the latter problem. We demonstrate the superiority of NVI in a large experiment involving an important NLP application, grammar induction, using real corpus data in English, German and Chinese. © 2009 Association for Computational Linguistics.
CITATION STYLE
Reichart, R., & Rappoport, A. (2009). The NVI clustering evaluation measure. In CoNLL 2009 - Proceedings of the Thirteenth Conference on Computational Natural Language Learning (pp. 165–173). Association for Computational Linguistics (ACL). https://doi.org/10.3115/1596374.1596401
Mendeley helps you to discover research relevant for your work.