Investigation of linguistic features and clustering algorithms for topical document clustering

91Citations
Citations of this article
62Readers
Mendeley users who have this article in their library.
Get full text

Abstract

We investigate four hierarchical clustering methods (single-link, complete-link, groupwise-average, and single-pass) and two linguistically motivated text features (noun phrase heads and proper names) in the context of document clustering. A statistical model for combining similarity reformation from multiple sources is described and applied to DARPA's Topic Detection and Tracking phase 2 (TDT2) data. This model, based on log-linear regression, alleviates the need for extensive search in order to determine optimal weights for combining input features. Through an extensive series of experiments with more than 40,000 documents from multiple news sources and modalities, we establish that both the choice of clustering algorithm and the introduction of the additional features have an impact on clustering performance. We apply our optimal combination of features to the TDT2 test data, obtaining partitions of the documents that compare favorably with the results obtained by participants in the official TDT2 competition.

Cite

CITATION STYLE

APA

Hatzivassiloglou, V., Gravano, L., & Maganti, A. (2000). Investigation of linguistic features and clustering algorithms for topical document clustering. In SIGIR Forum (ACM Special Interest Group on Information Retrieval) (pp. 224–231). ACM. https://doi.org/10.1145/345508.345582

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free