Comparative document analysis for large text corpora

16Citations
Citations of this article
66Readers
Mendeley users who have this article in their library.
Get full text

Abstract

This paper presents a novel research problem, Comparative Docu- ment Analysis (CDA), that is, joint discovery of commonalities and differences between two individual documents (or two sets of doc- uments) in a large text corpus. Given any pair of documents from a (background) document collection, CDA aims to automatically identify sets of quality phrases to summarize the commonalities of both documents and highlight the distinctions of each with respect to the other informatively and concisely. Our solution uses a gen- eral graph-based framework to derive novel measures on phrase semantic commonality and pairwise distinction, where the back- ground corpus is used for computing phrase-document semantic relevance. We use the measures to guide the selection of sets of phrases by solving two joint optimization problems. A scalable iterative algorithm is developed to integrate the maximization of phrase commonality or distinction measure with the learning of phrase-document semantic relevance. Experiments on large text corpora from two different domains-scientific papers and news- demonstrate the effectiveness and robustness of the proposed frame- work on comparing documents. Analysis on a 10GB+ text corpus demonstrates the scalability of our method, whose computation time grows linearly as the corpus size increases. Our case study on comparing news articles published at different dates shows the power of the proposed method on comparing sets of documents.

Cite

CITATION STYLE

APA

Ren, X., Lv, Y., Wang, K., & Han, J. (2017). Comparative document analysis for large text corpora. In WSDM 2017 - Proceedings of the 10th ACM International Conference on Web Search and Data Mining (pp. 325–334). Association for Computing Machinery, Inc. https://doi.org/10.1145/3018661.3018690

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free