Word segmentation for dialect translation

1Citations
Citations of this article
6Readers
Mendeley users who have this article in their library.
Get full text

Abstract

This paper proposes an unsupervised word segmentation algorithm that identifies word boundaries in continuous source language text in order to improve the translation quality of statistical machine translation (SMT) approaches for the translation of local dialects by exploiting linguistic information of the standard language. The method iteratively learns multiple segmentation schemes that are consistent with (1) the standard dialect segmentations and (2) the phrasal segmentations of an SMT system trained on the resegmented bitext of the local dialect. In a second step multiple segmentation schemes are integrated into a single SMT system by characterizing the source language side and merging identical translation pairs of differently segmented SMT models. Experimental results translating three Japanese local dialects (Kumamoto, Kyoto, Osaka) into three Indo-European languages (English, German, Russian) revealed that the proposed system outperforms SMT engines trained on character-based as well as standard dialect segmentation schemes for the majority of the investigated translation tasks and automatic evaluation metrics. © 2011 Springer-Verlag.

Cite

CITATION STYLE

APA

Paul, M., Finch, A., & Sumita, E. (2011). Word segmentation for dialect translation. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 6609 LNCS, pp. 55–67). https://doi.org/10.1007/978-3-642-19437-5_5

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free