Unsupervised tokenization for machine translation

40Citations
Citations of this article
137Readers
Mendeley users who have this article in their library.

Abstract

Training a statistical machine translation starts with tokenizing a parallel corpus. Some languages such as Chinese do not incorporate spacing in their writing system, which creates a challenge for tokenization. Moreover, morphologically rich languages such as Korean present an even bigger challenge, since optimal token boundaries for machine translation in these languages are often unclear. Both rule-based solutions and statistical solutions are currently used. In this paper, we present unsupervised methods to solve tokenization problem. Our methods incorporate information available from parallel corpus to determine a good tokenization for machine translation. © 2009 ACL and AFNLP.

Cite

CITATION STYLE

APA

Chung, T., & Gildea, D. (2009). Unsupervised tokenization for machine translation. In EMNLP 2009 - Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: A Meeting of SIGDAT, a Special Interest Group of ACL, Held in Conjunction with ACL-IJCNLP 2009 (pp. 718–726). Association for Computational Linguistics (ACL). https://doi.org/10.3115/1699571.1699606

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free