An efficient framework to extract parallel units from comparable data

Lu Xiang; Yu Zhou; Chengqing Zong

Conference Proceedings

An efficient framework to extract parallel units from comparable data

Communications in Computer and Information Science (2013) 400 151-163

DOI: 10.1007/978-3-642-41644-6_15

6Citations

3Readers

Get full text

Abstract

Since the quality of statistical machine translation (SMT) is heavily dependent upon the size and quality of training data, many approaches have been proposed for automatically mining bilingual text from comparable corpora. However, the existing solutions are restricted to extract either bilingual sentences or sub-sentential fragments. Instead, we present an efficient framework to extract both sentential and sub-sentential units. At sentential level, we consider the parallel sentence identification as a classification problem and extract more representative and effective features. At sub-sentential level, we refer to the idea of phrase table's acquisition in SMT to extract parallel fragments. A novel word alignment model is specially designed for comparable sentence pairs and parallel fragments can be extracted based on such word alignment. We integrate the two levels' extraction task into a united framework. Experimental results on SMT show that the baseline SMT system can achieve significant improvement by adding those extra-mined knowledge. © Springer-Verlag Berlin Heidelberg 2013.

Author supplied keywords

Cite

CITATION STYLE

APA

Xiang, L., Zhou, Y., & Zong, C. (2013). An efficient framework to extract parallel units from comparable data. In Communications in Computer and Information Science (Vol. 400, pp. 151–163). Springer Verlag. https://doi.org/10.1007/978-3-642-41644-6_15

An efficient framework to extract parallel units from comparable data

Abstract

Author supplied keywords

Cite

Register to see more suggestions