A Wikipedia-based multilingual retrieval model

144Citations
Citations of this article
109Readers
Mendeley users who have this article in their library.
Get full text

Abstract

This paper introduces CL-ESA, a new multilingual retrieval model for the analysis of cross-language similarity. The retrieval model exploits the multilingual alignment of Wikipedia: given a document d written in language L we construct a concept vector d for d, where each dimension i in d quantifies the similarity of d with respect to a document d*ichosen from the "L-subset" of Wikipedia. Likewise, for a second document d' written in language L', L ≠ L' we construct a concept vector d', using from the L'-subset of the Wikipedia the topic-aligned counterparts d'*i of our previously chosen documents. Since the two concept vectors d and d' are collection-relative representations of d and d' they are language-independent. I. e., their similarity can directly be computed with the cosine similarity measure, for instance. We present results of an extensive analysis that demonstrates the power of this new retrieval model: for a query document d the topically most similar documents from a corpus in another language are properly ranked. Salient property of the new retrieval model is its robustness with respect to both the size and the quality of the index document collection. © 2008 Springer-Verlag Berlin Heidelberg.

Cite

CITATION STYLE

APA

Potthast, M., Stein, B., & Anderka, M. (2008). A Wikipedia-based multilingual retrieval model. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4956 LNCS, pp. 522–530). https://doi.org/10.1007/978-3-540-78646-7_51

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free