Retro: Concept-based clustering of biomedical topical sets

13Citations
Citations of this article
31Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Motivation: Clustering methods can be useful for automatically grouping documents into meaningful clusters, improving human comprehension of a document collection. Although there are clustering algorithms that can achieve the goal for relatively large document collections, they do not always work well for small and homogenous datasets. Methods: In this article, we present Retro - a novel clustering algorithm that extracts meaningful clusters along with concise and descriptive titles from small and homogenous document collections. Unlike common clustering approaches, our algorithm predicts cluster titles before clustering. It relies on the hypergeometric distribution model to discover key phrases, and generates candidate clusters by assigning documents to these phrases. Further, the statistical significance of candidate clusters is tested using supervised learning methods, and a multiple testing correction technique is used to control the overall quality of clustering. Results: We test our system on five disease datasets from OMIM® and evaluate the results based on MeSH® term assignments. We further compare our method with several baseline and state-of-the-art methods, including K-means, expectation maximization, latent Dirichlet allocation-based clustering, Lingo, OPTIMSRC and adapted GK-means. The experimental results on the 20-Newsgroup and ODP-239 collections demonstrate that our method is successful at extracting significant clusters and is superior to existing methods in terms of quality of clusters. Finally, we apply our system to a collection of 6248 topical sets from the HomoloGene® database, a resource in PubMed®. Empirical evaluation confirms the method is useful for small homogenous datasets in producing meaningful clusters with descriptive titles.

Cite

CITATION STYLE

APA

Yeganova, L., Kim, W., Kim, S., & Wilbur, W. J. (2014). Retro: Concept-based clustering of biomedical topical sets. Bioinformatics, 30(22), 3240–3248. https://doi.org/10.1093/bioinformatics/btu514

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free