Towards understanding sample variance in visually grounded language generation: Evaluations and observations

6Citations
Citations of this article
80Readers
Mendeley users who have this article in their library.

Abstract

A major challenge in visually grounded language generation is to build robust benchmark datasets and models that can generalize well in real-world settings. To do this, it is critical to ensure that our evaluation protocols are correct, and benchmarks are reliable. In this work, we set forth to design a set of experiments to understand an important but often ignored problem in visually grounded language generation: given that humans have different utilities and visual attention, how will the sample variance in multi-reference datasets affect the models' performance? Empirically, we study several multi-reference datasets and corresponding vision-and-language tasks. We show that it is of paramount importance to report variance in experiments; that human-generated references could vary drastically in different datasets/tasks, revealing the nature of each task; that metric-wise, CIDEr has shown systematically larger variances than others. Our evaluations on reference-per-instance shed light on the design of reliable datasets in the future.

Cite

CITATION STYLE

APA

Zhu, W., Wang, X. E., Narayana, P., Sone, K., Basu, S., & Wang, W. Y. (2020). Towards understanding sample variance in visually grounded language generation: Evaluations and observations. In EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference (pp. 8806–8811). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2020.emnlp-main.708

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free