Automated mining of relevant n-grams in relation to predominant topics of text documents

5Citations
Citations of this article
9Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The article describes a method focused on the automatic analysis of large collections of short Internet textual documents, freely written in various natural languages and represented as sparse vectors, to reveal what multi-word phrases are relevant in relation to a given basic categorization. In addition, the revealed phrases serve for discovering additional different predominant topics, which are not explicitly expressed by the basic categories. The main idea is to look for n-grams where an n-gram is a collocation of n consecutive words. This leads to the problem of relevant feature selection where a feature is an n-gram that provides more information than an individual word. The feature selection is carried out by entropy minimization which returns a set of combined relevant n-grams and can be used for creating rules, decision trees, or information retrieval. The results are demonstrated for English, German, Spanish, and Russian customer reviews of hotel services publicly available on the web. The most informative output was given by 3-grams.

Cite

CITATION STYLE

APA

Žižka, J., & Dařena, F. (2015). Automated mining of relevant n-grams in relation to predominant topics of text documents. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9302, pp. 461–469). Springer Verlag. https://doi.org/10.1007/978-3-319-24033-6_52

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free