An efficient framework to extract parallel units from comparable data

6Citations
Citations of this article
3Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Since the quality of statistical machine translation (SMT) is heavily dependent upon the size and quality of training data, many approaches have been proposed for automatically mining bilingual text from comparable corpora. However, the existing solutions are restricted to extract either bilingual sentences or sub-sentential fragments. Instead, we present an efficient framework to extract both sentential and sub-sentential units. At sentential level, we consider the parallel sentence identification as a classification problem and extract more representative and effective features. At sub-sentential level, we refer to the idea of phrase table's acquisition in SMT to extract parallel fragments. A novel word alignment model is specially designed for comparable sentence pairs and parallel fragments can be extracted based on such word alignment. We integrate the two levels' extraction task into a united framework. Experimental results on SMT show that the baseline SMT system can achieve significant improvement by adding those extra-mined knowledge. © Springer-Verlag Berlin Heidelberg 2013.

Cite

CITATION STYLE

APA

Xiang, L., Zhou, Y., & Zong, C. (2013). An efficient framework to extract parallel units from comparable data. In Communications in Computer and Information Science (Vol. 400, pp. 151–163). Springer Verlag. https://doi.org/10.1007/978-3-642-41644-6_15

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free