A Case Study of the Shortcut Effects in Visual Commonsense Reasoning

26Citations
Citations of this article
21Readers
Mendeley users who have this article in their library.

Abstract

Visual reasoning and question-answering have gathered attention in recent years. Many datasets and evaluation protocols have been proposed; some have been shown to contain bias that allows models to “cheat” without performing true, generalizable reasoning. A well-known bias is dependence on language priors (frequency of answers) resulting in the model not looking at the image. We discover a new type of bias in the Visual Commonsense Reasoning (VCR) dataset. In particular we show that most state-of-the-art models exploit co-occurring text between input (question) and output (answer options), and rely on only a few pieces of information in the candidate options, to make a decision. Unfortunately, relying on such superficial evidence causes models to be very fragile. To measure fragility, we propose two ways to modify the validation data, in which a few words in the answer choices are modified without significant changes in meaning. We find such insignificant changes cause models' performance to degrade significantly. To resolve the issue, we propose a curriculum-based masking approach, as a mechanism to perform more robust training. Our method improves the baseline by requiring it to pay attention to the answers as a whole, and is more effective than prior masking strategies.

Cite

CITATION STYLE

APA

Ye, K., & Kovashka, A. (2021). A Case Study of the Shortcut Effects in Visual Commonsense Reasoning. In 35th AAAI Conference on Artificial Intelligence, AAAI 2021 (Vol. 4B, pp. 3181–3189). Association for the Advancement of Artificial Intelligence. https://doi.org/10.1609/aaai.v35i4.16428

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free