Multilingual information retrieval tasks such as Topic Tracking have yielded high-quality results simply using word-by-word translation approaches. However, the construction of translation dictionaries for new languages is expensive and time-consuming. We show that an appropriate metric for term selection in a monolingual English corpus allows us to define a fairly small list, containing about ten thousand inflected forms or about 7500 lemmas, which works essentially as well (for a particular monolingual document classification evaluation) as an unlimited vocabulary of more than 300,000 word forms does. We suggest that such a list can be taken to form the English axis of a sort of "universal dictionary" for document classification tasks, providing a much more efficient path to the addition of new languages.
CITATION STYLE
Schultz, J. M., & Liberman, M. Y. (2002). Towards a “Universal Dictionary” for Multi-Language Information Retrieval Applications (pp. 225–241). https://doi.org/10.1007/978-1-4615-0933-2_11
Mendeley helps you to discover research relevant for your work.