We present an empirical analysis of state-ofthe- art systems for referring expression recognition - the task of identifying the object in an image referred to by a natural language expression - with the goal of gaining insight into how these systems reason about language and vision. Surprisingly, we find strong evidence that even sophisticated and linguistically-motivated models for this task may ignore linguistic structure, instead relying on shallow correlations introduced by unintended biases in the data selection and annotation process. For example, we show that a system trained and tested on the input image without the input referring expression can achieve a precision of 71.2% in top-2 predictions. Furthermore, a system that predicts only the object category given the input can achieve a precision of 84.2% in top-2 predictions. These surprisingly positive results for what should be deficient prediction scenarios suggest that careful analysis of what our models are learning - and further, how our data is constructed - is critical as we seek to make substantive progress on grounded language tasks.
CITATION STYLE
Cirik, V., Morency, L. P., & Berg-Kirkpatrick, T. (2018). Visual referring expression recognition: What do systems actually learn? In NAACL HLT 2018 - 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference (Vol. 2, pp. 781–787). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/n18-2123
Mendeley helps you to discover research relevant for your work.