We focus on the Embodied Question Answering (EQA) task, the dataset and the models (Das et al., 2018). In particular, we examine the effects of vision perturbation at different levels by providing the model with either incongruent, black or random noise images. We observe that the model is still able to learn from general visual patterns, suggesting that they capture some common sense reasoning about the visual world. We argue that a better set of data and models are required to achieve better performance in predicting (generating) correct answers. The code is available here: https://github.com/GU-CLASP/embodied-qa.
CITATION STYLE
Ilinykh, N., Emampoor, Y., & Dobnik, S. (2022). Look and Answer the Question: On the Role of Vision in Embodied Question Answering. In 15th International Natural Language Generation Conference, INLG 2022 (pp. 236–245). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.inlg-main.19
Mendeley helps you to discover research relevant for your work.