A major challenge in visually grounded language generation is to build robust benchmark datasets and models that can generalize well in real-world settings. To do this, it is critical to ensure that our evaluation protocols are correct, and benchmarks are reliable. In this work, we set forth to design a set of experiments to understand an important but often ignored problem in visually grounded language generation: given that humans have different utilities and visual attention, how will the sample variance in multi-reference datasets affect the models' performance? Empirically, we study several multi-reference datasets and corresponding vision-and-language tasks. We show that it is of paramount importance to report variance in experiments; that human-generated references could vary drastically in different datasets/tasks, revealing the nature of each task; that metric-wise, CIDEr has shown systematically larger variances than others. Our evaluations on reference-per-instance shed light on the design of reliable datasets in the future.
CITATION STYLE
Zhu, W., Wang, X. E., Narayana, P., Sone, K., Basu, S., & Wang, W. Y. (2020). Towards understanding sample variance in visually grounded language generation: Evaluations and observations. In EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference (pp. 8806–8811). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2020.emnlp-main.708
Mendeley helps you to discover research relevant for your work.