The use of explicit object detectors as an intermediate step to image captioning ? which used to constitute an essential stage in early work ? is often bypassed in the currently dominant end-To-end approaches, where the language model is conditioned directly on a midlevel image embedding. We argue that explicit detections provide rich semantic information, and can thus be used as an interpretable representation to better understand why end-To-end image captioning systems work well. We provide an in-depth analysis of end-To-end image captioning by exploring a variety of cues that can be derived from such object detections. Our study reveals that end-To-end image captioning systems rely on matching image representations to generate captions, and that encoding the frequency, size and position of objects are complementary and all play a role in forming a good image representation. It also reveals that different object categories contribute in different ways towards image captioning.
CITATION STYLE
Wang, J., Madhyastha, P., & Specia, L. (2018). Object counts! bringing explicit detections back into image captioning. In NAACL HLT 2018 - 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference (Vol. 1, pp. 2180–2193). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/n18-1198
Mendeley helps you to discover research relevant for your work.