Empirical formula for testing word similarity and its application for constructing a word frequency list

1Citations
Citations of this article
1Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In many tasks of document categorization and clustering it is necessary to automatically learn a word frequency list from a corpus. However, morphological variations of words disturb the statistics when the program considers the words as mere letter strings. Thus it is important to identify the strings resulting from morphological variation of the same base meaning. Since using large morphological dictionaries has its well-known technical disadvantages, we propose a heuristic approximate method for such identification based on an empirical formula for testing the similarity of two words. We give a simple method for the determination of the formula parameters. The formula is based on the number of the coincident letters in the initial parts of the two words and the number of non-coincident letters in the final parts of these two words. An iterative algorithm constructs the word frequency list using common parts of all similar words. We give English and Spanish examples. The described technology is implemented in our system Dictionary Designer.

Cite

CITATION STYLE

APA

Makagonov, P., & Alexandrov, M. (2002). Empirical formula for testing word similarity and its application for constructing a word frequency list. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 2276, pp. 425–432). Springer Verlag. https://doi.org/10.1007/3-540-45715-1_45

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free