We study the impact of translation resource scarcity on the performance of cross-language information retrieval (CLIR) systems. To do that, we develop a contrastive analysis framework that uses high-resource languages to simulate low-resource languages. In the framework, we focus on parallel translation corpora and aim to better understand the factors that impact CLIR performance. We argue that both low- and high-resource corpora are needed to develop that understanding. Hence, we take the approach of starting with a true low-resource language and systematically downsampling a high-resource language to become an artificial lowresource language-the reverse perspective of existing research. We formalize the problem as the Resource Scarcity Simulation (RSS) problem. We model the problem with a family of set covering problems, formulate with integer linear programming, and prove that the problem is actually NP-hard. To this end, we provide two greedy algorithms with polynomial complexities.We compare and analyze our approach with alternate techniques using four high-resource languages (French, Italian, German, and Finnish) down-sampled to simulate two low-resource languages (Somali and Swahili). Our experimental results suggest that language families are important for the RSS problem.We simulate Somali with German, and Swahili with Finnish, achieving 98% and 97% on the similarity percentage in terms of CLIR performance, respectively.
CITATION STYLE
Bonab, H., Allan, J., & Sitaraman, R. (2019). Simulating CLIR translation resource scarcity using high-resource languages. In ICTIR 2019 - Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval (pp. 129–136). Association for Computing Machinery, Inc. https://doi.org/10.1145/3341981.3344236
Mendeley helps you to discover research relevant for your work.