Visual reasoning and question-answering have gathered attention in recent years. Many datasets and evaluation protocols have been proposed; some have been shown to contain bias that allows models to “cheat” without performing true, generalizable reasoning. A well-known bias is dependence on language priors (frequency of answers) resulting in the model not looking at the image. We discover a new type of bias in the Visual Commonsense Reasoning (VCR) dataset. In particular we show that most state-of-the-art models exploit co-occurring text between input (question) and output (answer options), and rely on only a few pieces of information in the candidate options, to make a decision. Unfortunately, relying on such superficial evidence causes models to be very fragile. To measure fragility, we propose two ways to modify the validation data, in which a few words in the answer choices are modified without significant changes in meaning. We find such insignificant changes cause models' performance to degrade significantly. To resolve the issue, we propose a curriculum-based masking approach, as a mechanism to perform more robust training. Our method improves the baseline by requiring it to pay attention to the answers as a whole, and is more effective than prior masking strategies.
CITATION STYLE
Ye, K., & Kovashka, A. (2021). A Case Study of the Shortcut Effects in Visual Commonsense Reasoning. In 35th AAAI Conference on Artificial Intelligence, AAAI 2021 (Vol. 4B, pp. 3181–3189). Association for the Advancement of Artificial Intelligence. https://doi.org/10.1609/aaai.v35i4.16428
Mendeley helps you to discover research relevant for your work.