Word segmentation for dialect translation

Michael Paul; Andrew Finch; Eiichiro Sumita

Conference Proceedings

Word segmentation for dialect translation

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2011) 6609 LNCS(PART 2) 55-67

DOI: 10.1007/978-3-642-19437-5_5

1Citations

6Readers

Get full text

Abstract

This paper proposes an unsupervised word segmentation algorithm that identifies word boundaries in continuous source language text in order to improve the translation quality of statistical machine translation (SMT) approaches for the translation of local dialects by exploiting linguistic information of the standard language. The method iteratively learns multiple segmentation schemes that are consistent with (1) the standard dialect segmentations and (2) the phrasal segmentations of an SMT system trained on the resegmented bitext of the local dialect. In a second step multiple segmentation schemes are integrated into a single SMT system by characterizing the source language side and merging identical translation pairs of differently segmented SMT models. Experimental results translating three Japanese local dialects (Kumamoto, Kyoto, Osaka) into three Indo-European languages (English, German, Russian) revealed that the proposed system outperforms SMT engines trained on character-based as well as standard dialect segmentation schemes for the majority of the investigated translation tasks and automatic evaluation metrics. © 2011 Springer-Verlag.

Cite

CITATION STYLE

APA

Paul, M., Finch, A., & Sumita, E. (2011). Word segmentation for dialect translation. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 6609 LNCS, pp. 55–67). https://doi.org/10.1007/978-3-642-19437-5_5

Word segmentation for dialect translation

Abstract

Cite

Register to see more suggestions