Since the quality of statistical machine translation (SMT) is heavily dependent upon the size and quality of training data, many approaches have been proposed for automatically mining bilingual text from comparable corpora. However, the existing solutions are restricted to extract either bilingual sentences or sub-sentential fragments. Instead, we present an efficient framework to extract both sentential and sub-sentential units. At sentential level, we consider the parallel sentence identification as a classification problem and extract more representative and effective features. At sub-sentential level, we refer to the idea of phrase table's acquisition in SMT to extract parallel fragments. A novel word alignment model is specially designed for comparable sentence pairs and parallel fragments can be extracted based on such word alignment. We integrate the two levels' extraction task into a united framework. Experimental results on SMT show that the baseline SMT system can achieve significant improvement by adding those extra-mined knowledge. © Springer-Verlag Berlin Heidelberg 2013.
CITATION STYLE
Xiang, L., Zhou, Y., & Zong, C. (2013). An efficient framework to extract parallel units from comparable data. In Communications in Computer and Information Science (Vol. 400, pp. 151–163). Springer Verlag. https://doi.org/10.1007/978-3-642-41644-6_15
Mendeley helps you to discover research relevant for your work.