Augmenting SMT with semantically-generated virtual-parallel corpora from monolingual texts

1Citations
Citations of this article
3Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Several natural languages have undergone a great deal of processing, but the problem of limited textual linguistic resources remains. The manual creation of parallel corpora by humans is rather expensive and time consuming, while the language data required for statistical machine translation (SMT) do not exist in adequate quantities for their statistical information to be used to initiate the research process. On the other hand, applying known approaches to build parallel resources from multiple sources, such as comparable or quasi-comparable corpora, is very complicated and provides rather noisy output, which later needs to be further processed and requires in-domain adaptation. To optimize the performance of comparable corpora mining algorithms, it is essential to use a quality parallel corpus for training of a good data classifier. In this research, we have developed a methodology for generating an accurate parallel corpus (Czech-English) from monolingual resources by calculating the compatibility between the results of three machine translation systems. We have created translations of large, single-language resources by applying multiple translation systems and strictly measuring translation compatibility using rules based on the Levenshtein distance. The results produced by this approach were very favorable. The generated corpora successfully improved the quality of SMT systems and seem to be useful for many other natural language processing tasks.

Cite

CITATION STYLE

APA

Wołk, K., & Wołk, A. (2018). Augmenting SMT with semantically-generated virtual-parallel corpora from monolingual texts. In Advances in Intelligent Systems and Computing (Vol. 745, pp. 358–374). Springer Verlag. https://doi.org/10.1007/978-3-319-77703-0_37

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free