Unsupervised and Pseudo-Supervised Vision-Language Alignment in Visual Dialog

Feilong Chen; Duzhen Zhang; Xiuyi Chen; Jing Shi; Shuang Xu; Bo Xu

Conference ProceedingsOPEN ACCESS

Unsupervised and Pseudo-Supervised Vision-Language Alignment in Visual Dialog

MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia (2022) 4142-4153

DOI: 10.1145/3503161.3547776

10Citations

9Readers

Abstract

Visual dialog requires models to give reasonable answers according to a series of coherent questions and related visual concepts in images. However, most current work either focuses on attention-based fusion or pre-training on large-scale image-text pairs, ignoring the critical role of explicit vision-language alignment in visual dialog. To remedy this defect, we propose a novel unsupervised and pseudo-supervised vision-language alignment approach for visual dialog (AlignVD). Firstly, AlginVD utilizes the visual and dialog encoder to represent images and dialogs. Then, it explicitly aligns visual concepts with textual semantics via unsupervised and pseudo-supervised vision-language alignment (UVLA and PVLA). Specifically, UVLA utilizes a graph autoencoder, while PVLA uses dialog-guided visual grounding to conduct alignment. Finally, based on the aligned visual and textual representations, AlignVD gives a reasonable answer to the question via the cross-modal decoder. Extensive experiments on two large-scale visual dialog datasets have demonstrated the effectiveness of vision-language alignment, and our proposed AlignVD achieves new state-of-the-art results. In addition, our single model has won first place on the visual dialog challenge leaderboard with a NDCG metric of 78.70, surpassing the previous best ensemble model by about 1 point.

Author supplied keywords

Cite

CITATION STYLE

APA

Chen, F., Zhang, D., Chen, X., Shi, J., Xu, S., & Xu, B. (2022). Unsupervised and Pseudo-Supervised Vision-Language Alignment in Visual Dialog. In MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia (pp. 4142–4153). Association for Computing Machinery, Inc. https://doi.org/10.1145/3503161.3547776

Unsupervised and Pseudo-Supervised Vision-Language Alignment in Visual Dialog

Abstract

Author supplied keywords

Cite

Register to see more suggestions