Statistical machine translation models are known to benefit from the availability of a domain bilingual lexicon. Bilingual lexicons are traditionally comprised of multiword expressions, either extracted from parallel corpora or manually curated. We claim that “patterns”, comprised of words and higher order categories, generalize better in capturing the syntax and semantics of the domain. In this work, we present an approach to extract such patterns from a domain corpus and curate a high quality bilingual lexicon. We discuss several features of these patterns, that, define the “consensus” between their underlying multiwords. We incorporate the bilingual lexicon in a baseline SMT model and detailed experiments show that the resulting translation model performs much better than the baseline and other similar systems.
CITATION STYLE
Singh, P., Kulkarni, A., Ojha, H., Kumar, V., & Ramakrishnan, G. (2016). Building compact lexicons for cross-domain SMT by mining near-optimal pattern sets. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9651, pp. 290–303). Springer Verlag. https://doi.org/10.1007/978-3-319-31753-3_24
Mendeley helps you to discover research relevant for your work.