Variance-based features for keyword extraction in Persian and English text documents

4Citations
Citations of this article
10Readers
Mendeley users who have this article in their library.

Abstract

This paper addresses automatic keyword extraction in Persian and English text documents. Generally, to extract keywords from a text, a weight is assigned to each token, and words characterized by higher weights are selected as the keywords. This study proposed four methods for weighting the words and compared these methods with five previous weighting techniques. The previous methods used in this paper include Term Frequency (TF), Term Frequency Inverse Document Frequency (TF-IDF), variance, Discriminative Feature Selection (DFS), and document length normalization based on unit words (LNU). The proposed weighting methods are presented using variance features and include variance to TF-IDF ratio, variance to TF ratio, the intersection of TF and variance, and the intersection of variance and IDF. For evaluation, the documents are clustered using the extracted keywords as feature vectors and by using K-means, Expectation Maximization (EM), and Ward hierarchical clustering methods. The entropy of the clusters and predefined classes of the documents are used as the evaluation metrics. For the evaluations, this study collected and labeled Persian documents. Results showed that the proposed weighting method, variance to TF ratio, showed the best performance for Persian texts. Moreover, the best entropy was found by variance to TD-IDF ratio for English texts.

Cite

CITATION STYLE

APA

Veisi, H., Aflaki, N., & Parsafard, P. (2020). Variance-based features for keyword extraction in Persian and English text documents. Scientia Iranica, 27(3 D), 1301–1315. https://doi.org/10.24200/SCI.2019.50426.1685

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free