Abstract
Recent work has questioned the necessity of visual information in Multimodal Machine Translation (MMT). This paper tries to answer this question and build a new benchmark in this work. As the available dataset is simple and the text input is self-sufficient, we introduce a challenging dataset called EMMT, whose testset is deliberately designed to ensure ambiguity. More importantly, we study this problem in a real-word scenario towards making the most of multimodal training data. We propose a new framework 2/3-Triplet which can naturally make full use of large-scale image-text and parallel text-only data. Extensive experiments show that visual information is highly crucial in EMMT. The proposed 2/3-Triplet outperforms the strong text-only competitor by 3.8 BLEU score, and even bypasses a commercial translation system.
Cite
CITATION STYLE
Zhu, Y., Sun, Z., Cheng, S., Huang, L., Wu, L., & Wang, M. (2023). Beyond Triplet: Leveraging the Most Data for Multimodal Machine Translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 2679–2697). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-acl.168
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.