On Vision Features in Multimodal Machine Translation

45Citations
Citations of this article
70Readers
Mendeley users who have this article in their library.

Abstract

Previous work on multimodal machine translation (MMT) has focused on the way of incorporating vision features into translation but little attention is on the quality of vision models. In this work, we investigate the impact of vision models on MMT. Given the fact that Transformer is becoming popular in computer vision, we experiment with various strong models (such as Vision Transformer) and enhanced features (such as object-detection and image captioning). We develop a selective attention model to study the patch-level contribution of an image in MMT. On detailed probing tasks, we find that stronger vision models are helpful for learning translation from the visual modality. Our results also suggest the need of carefully examining MMT models, especially when current benchmarks are small-scale and biased. Our code could be found at https://github.com/libeineu/fairseq_mmt.

Cite

CITATION STYLE

APA

Li, B., Lv, C., Zhou, Z., Zhou, T., Xiao, T., Ma, A., & Zhu, J. (2022). On Vision Features in Multimodal Machine Translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 6327–6337). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.acl-long.438

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free