Mining Infrequent High-Quality Phrases from Domain-Specific Corpora

12Citations
Citations of this article
22Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Phrase mining is a fundamental task for text analysis and has various downstream applications such as named entity recognition, topic modeling, and relation extraction. In this paper, we focus on mining high-quality phrases from domain-specific corpora with special consideration of infrequent ones. Previous methods might miss infrequent high-quality phrases in the candidate selection stage. And these methods rely on explicit features to mine phrases while rarely considering the implicit features. In addition, completeness is rarely explicitly considered in the evaluation of a high-quality phrase. In this paper, we propose a novel approach that exploits a sequence labeling model to capture infrequent phrases. And we employ implicit semantic features and contextual POS tag statistics to measure meaningfulness and completeness, respectively. Experiments over four real-world corpora demonstrate that our method achieves significant improvements over previous state-of-the-art methods across different domains and languages.

Cite

CITATION STYLE

APA

Wang, L., Zhu, W., Jiang, S., Zhang, S., Wang, K., Ni, Y., … Xiao, Y. (2020). Mining Infrequent High-Quality Phrases from Domain-Specific Corpora. In International Conference on Information and Knowledge Management, Proceedings (pp. 1535–1544). Association for Computing Machinery. https://doi.org/10.1145/3340531.3412029

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free