Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

Jize Cao; Zhe Gan; Yu Cheng; Licheng Yu; Yen Chun Chen; Jingjing Liu

Conference Proceedings

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2020) 12351 LNCS 565-580

DOI: 10.1007/978-3-030-58539-6_34

51Citations

181Readers

Get full text

Abstract

Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research. Models such as ViLBERT, LXMERT and UNITER have significantly lifted state of the art across a wide range of V+L benchmarks. However, little is known about the inner mechanisms that destine their impressive success. To reveal the secrets behind the scene, we present Value (Vision-And-Language Understanding Evaluation), a set of meticulously designed probing tasks (e.g., Visual Coreference Resolution, Visual Relation Detection) generalizable to standard pre-trained V+L models, to decipher the inner workings of multimodal pre-training (e.g., implicit knowledge garnered in individual attention heads, inherent cross-modal alignment learned through contextualized multimodal embeddings). Through extensive analysis of each archetypal model architecture via these probing tasks, our key observations are: (i) Pre-trained models exhibit a propensity for attending over text rather than images during inference. (ii) There exists a subset of attention heads that are tailored for capturing cross-modal interactions. (iii) Learned attention matrix in pre-trained models demonstrates patterns coherent with the latent alignment between image regions and textual words. (iv) Plotted attention patterns reveal visually-interpretable relations among image regions. (v) Pure linguistic knowledge is also effectively encoded in the attention heads. These are valuable insights serving to guide future work towards designing better model architecture and objectives for multimodal pre-training. (Code is available at https://github.com/JizeCao/VALUE).

Cite

CITATION STYLE

APA

Cao, J., Gan, Z., Cheng, Y., Yu, L., Chen, Y. C., & Liu, J. (2020). Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 12351 LNCS, pp. 565–580). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-58539-6_34

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

Abstract

Cite

Register to see more suggestions