Cross-language high similarity search: Why no sub-linear time bound can be expected

2Citations
Citations of this article
15Readers
Mendeley users who have this article in their library.
Get full text

Abstract

This paper contributes to an important variant of cross-language information retrieval, called cross-language high similarity search. Given a collection D of documents and a query q in a language different from the language of D, the task is to retrieve highly similar documents with respect to q. Use cases for this task include cross-language plagiarism detection and translation search. The current line of research in cross-language high similarity search resorts to the comparison of q and the documents in D in a multilingual concept space-which, however, requires a linear scan of D. Monolingual high similarity search can be tackled in sub-linear time, either by fingerprinting or by "brute force n-gram indexing", as it is done by Web search engines. We argue that neither fingerprinting nor brute force n-gram indexing can be applied to tackle cross-language high similarity search, and that a linear scan is inevitable. Our findings are based on theoretical and empirical insights. © 2010 Springer-Verlag Berlin Heidelberg.

Cite

CITATION STYLE

APA

Anderka, M., Stein, B., & Potthast, M. (2010). Cross-language high similarity search: Why no sub-linear time bound can be expected. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 5993 LNCS, pp. 640–644). Springer Verlag. https://doi.org/10.1007/978-3-642-12275-0_66

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free