Adjusting machine translation datasets for document-level cross-language information retrieval: Methodology

N/ACitations
Citations of this article
2Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Evaluating the performance of Cross-Language Information Retrieval models is a rather difficult task since collecting and assessing substantial amount of data for CLIR systems evaluation could be a non-trivial and expensive process. At the same time, substantial number of machine translation datasets are available now. In the present paper we attempt to solve the problem stated above by suggesting a strict workflow for transforming machine translation datasets to a CLIR evaluation dataset (with automatically obtained relevance assessments), as well as a workflow for extracting a representative subsample from the initial large corpus of documents so that it is appropriate for further manual assessment. We also hypothesize and then prove by the number of experiments on the United Nations Parallel Corpus data that the quality of an information retrieval algorithm on the automatically assessed sample could be in fact treated as a reasonable metric.

Cite

CITATION STYLE

APA

Shtekh, G., Kazakova, P., & Nikitinsky, N. (2018). Adjusting machine translation datasets for document-level cross-language information retrieval: Methodology. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11107 LNAI, pp. 84–94). Springer Verlag. https://doi.org/10.1007/978-3-030-00794-2_9

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free