This paper is concerned with data selection for adapting language model (LM) in statistical machine translation (SMT), and aims to find the LM training sentences that are topic similar to the translation task. Although the traditional methods have gained significant performance, they ignore the topic information and the distribution of words in calculating the sentence similarity. In this paper, the authors propose a topic model to discover the latent topics in the content of sentences, and combine the latent topic based similarity with TF-IDF into a unified framework for data selection. Furthermore, the authors combine a cross-lingual projecting method with the topic model, which makes the data selection depend on the source input directly. Large-scale experimental results demonstrate that the proposed approach significantly outperforms the traditional approaches on both LM perplexity and SMT performance. © 2012 Springer-Verlag.
CITATION STYLE
Lu, S., Wei, W., Fu, X., Fan, L., & Xu, B. (2012). Learning latent topic information for language model adaptation. In Communications in Computer and Information Science (Vol. 333 CCIS, pp. 143–153). https://doi.org/10.1007/978-3-642-34456-5_14
Mendeley helps you to discover research relevant for your work.