Abstract
This paper investigates active learning of constraints for semi-supervised document clustering. We make use of the intermediate clustering results to guide the document pair selection for obtaining user judgments for constraint generation. A gain function is designed for choosing the most informative document pairs given the current cluster assignments. This gain function measures how much we can learn by revealing the judgment of the document pairs. Two methods are investigated, namely, independent gain model and dependent gain model. In the independent gain model, we assume that the information learned by revealing the judgment of a document pair is independent of revealing the judgment of other document pairs. The dependent gain model also considers previously chosen documents to avoid redundant selection and maximize the gain collectively for a set of document pairs. Constrained semi-supervised clustering and gain directed document pair selection are conducted in an iterative manner. We have conducted extensive experiments on several real-world corpora. The results demonstrate that the intermediate clustering assignments and the interactions among a set of document pairs are useful for improving the clustering performance. Our approach is also superior to a recent existing work for this problem.
Cite
CITATION STYLE
Huang, R., Lam, W., & Zhang, Z. (2007). Active learning of constraints for semi-supervised text clustering. In Proceedings of the 7th SIAM International Conference on Data Mining (pp. 113–124). Society for Industrial and Applied Mathematics Publications. https://doi.org/10.1137/1.9781611972771.11
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.