Learning to contrast the counterfactual samples for robust visual question answering

136Citations
Citations of this article
122Readers
Mendeley users who have this article in their library.

Abstract

In the task of Visual Question Answering (VQA), most state-of-the-art models tend to learn spurious correlations in the training set and achieve poor performance in out-of-distribution test data. Some methods of generating counterfactual samples have been proposed to alleviate this problem. However, the counterfactual samples generated by most previous methods are simply added to the training data for augmentation and are not fully utilized. Therefore, we introduce a novel self-supervised contrastive learning mechanism to learn the relationship between original samples, factual samples and counterfactual samples. With the better cross-modal joint embeddings learned from the auxiliary training objective, the reasoning capability and robustness of the VQA model are boosted significantly. We evaluate the effectiveness of our method by surpassing current state-of-the-art models on the VQA-CP dataset, a diagnostic benchmark for assessing the VQA model's robustness.

Cite

CITATION STYLE

APA

Liang, Z., Jiang, W., Hu, H., & Zhu, J. (2020). Learning to contrast the counterfactual samples for robust visual question answering. In EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference (pp. 3285–3292). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2020.emnlp-main.265

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free