Unsupervised tokenization for machine translation

Tagyoung Chung; Daniel Gildea

Conference Proceedings

Unsupervised tokenization for machine translation

EMNLP 2009 - Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: A Meeting of SIGDAT, a Special Interest Group of ACL, Held in Conjunction with ACL-IJCNLP 2009 (2009) 718-726

DOI: 10.3115/1699571.1699606

40Citations

142Readers

Get full text

Abstract

Training a statistical machine translation starts with tokenizing a parallel corpus. Some languages such as Chinese do not incorporate spacing in their writing system, which creates a challenge for tokenization. Moreover, morphologically rich languages such as Korean present an even bigger challenge, since optimal token boundaries for machine translation in these languages are often unclear. Both rule-based solutions and statistical solutions are currently used. In this paper, we present unsupervised methods to solve tokenization problem. Our methods incorporate information available from parallel corpus to determine a good tokenization for machine translation. © 2009 ACL and AFNLP.

Cite

CITATION STYLE

APA

Chung, T., & Gildea, D. (2009). Unsupervised tokenization for machine translation. In EMNLP 2009 - Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: A Meeting of SIGDAT, a Special Interest Group of ACL, Held in Conjunction with ACL-IJCNLP 2009 (pp. 718–726). Association for Computational Linguistics (ACL). https://doi.org/10.3115/1699571.1699606

Unsupervised tokenization for machine translation

Abstract

Cite

Register to see more suggestions