Transformer-Exclusive Cross-Modal Representation for Vision and Language

Andrew Shin; Takuya Narihira

Conference ProceedingsOPEN ACCESS

Transformer-Exclusive Cross-Modal Representation for Vision and Language

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (2021) 2719-2725

DOI: 10.18653/v1/2021.findings-acl.240

2Citations

47Readers

Abstract

Ever since the advent of deep learning, cross-modal representation learning has been dominated by the approaches involving convolutional neural networks for visual representation and recurrent neural networks for language representation. Transformer architecture, however, has rapidly taken over the recurrent neural networks in natural language processing tasks, and it has also been shown that vision tasks can be handled with transformer architecture, with compatible performance to convolutional neural networks. Such results naturally lead to speculation upon the possibility of tackling cross-modal representation for vision and language exclusively with transformer. This paper examines transformer-exclusive cross-modal representation to explore such possibility, demonstrating its potentials as well as discussing its current limitations and its prospects.

Cite

CITATION STYLE

APA

Shin, A., & Narihira, T. (2021). Transformer-Exclusive Cross-Modal Representation for Vision and Language. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (pp. 2719–2725). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.findings-acl.240

Transformer-Exclusive Cross-Modal Representation for Vision and Language

Abstract

Cite

Register to see more suggestions