Cross-language high similarity search: Why no sub-linear time bound can be expected

Maik Anderka; Benno Stein; Martin Potthast

Conference Proceedings

Cross-language high similarity search: Why no sub-linear time bound can be expected

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2010) 5993 LNCS 640-644

DOI: 10.1007/978-3-642-12275-0_66

2Citations

15Readers

Get full text

Abstract

This paper contributes to an important variant of cross-language information retrieval, called cross-language high similarity search. Given a collection D of documents and a query q in a language different from the language of D, the task is to retrieve highly similar documents with respect to q. Use cases for this task include cross-language plagiarism detection and translation search. The current line of research in cross-language high similarity search resorts to the comparison of q and the documents in D in a multilingual concept space-which, however, requires a linear scan of D. Monolingual high similarity search can be tackled in sub-linear time, either by fingerprinting or by "brute force n-gram indexing", as it is done by Web search engines. We argue that neither fingerprinting nor brute force n-gram indexing can be applied to tackle cross-language high similarity search, and that a linear scan is inevitable. Our findings are based on theoretical and empirical insights. © 2010 Springer-Verlag Berlin Heidelberg.

Cite

CITATION STYLE

APA

Anderka, M., Stein, B., & Potthast, M. (2010). Cross-language high similarity search: Why no sub-linear time bound can be expected. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 5993 LNCS, pp. 640–644). Springer Verlag. https://doi.org/10.1007/978-3-642-12275-0_66

Cross-language high similarity search: Why no sub-linear time bound can be expected

Abstract

Cite

Register to see more suggestions