Adjusting machine translation datasets for document-level cross-language information retrieval: Methodology

Gennady Shtekh; Polina Kazakova; Nikita Nikitinsky

Conference Proceedings

Adjusting machine translation datasets for document-level cross-language information retrieval: Methodology

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2018) 11107 LNAI 84-94

DOI: 10.1007/978-3-030-00794-2_9

N/ACitations

2Readers

Get full text

Abstract

Evaluating the performance of Cross-Language Information Retrieval models is a rather difficult task since collecting and assessing substantial amount of data for CLIR systems evaluation could be a non-trivial and expensive process. At the same time, substantial number of machine translation datasets are available now. In the present paper we attempt to solve the problem stated above by suggesting a strict workflow for transforming machine translation datasets to a CLIR evaluation dataset (with automatically obtained relevance assessments), as well as a workflow for extracting a representative subsample from the initial large corpus of documents so that it is appropriate for further manual assessment. We also hypothesize and then prove by the number of experiments on the United Nations Parallel Corpus data that the quality of an information retrieval algorithm on the automatically assessed sample could be in fact treated as a reasonable metric.

Author supplied keywords

Cite

CITATION STYLE

APA

Shtekh, G., Kazakova, P., & Nikitinsky, N. (2018). Adjusting machine translation datasets for document-level cross-language information retrieval: Methodology. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11107 LNAI, pp. 84–94). Springer Verlag. https://doi.org/10.1007/978-3-030-00794-2_9

Adjusting machine translation datasets for document-level cross-language information retrieval: Methodology

Abstract

Author supplied keywords

Cite

Register to see more suggestions