Learning to relate from captions and bounding boxes

Sarthak Garg; Joel Ruben Antony Moniz; Anshu Aviral; Priyatham Bollimpalli

Conference ProceedingsOPEN ACCESS

Learning to relate from captions and bounding boxes

ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (2020) 6597-6603

DOI: 10.18653/v1/p19-1660

1Citations

112Readers

Abstract

In this work, we propose a novel approach that predicts the relationships between various entities in an image in a weakly supervised manner by relying on image captions and object bounding box annotations as the sole source of supervision. Our proposed approach uses a top-down attention mechanism to align entities in captions to objects in the image, and then leverage the syntactic structure of the captions to align the relations. We use these alignments to train a relation classification network, thereby obtaining both grounded captions and dense relationships. We demonstrate the effectiveness of our model on the Visual Genome dataset by achieving a recall@50 of 15% and recall@100 of 25% on the relationships present in the image. We also show that the model successfully predicts relations that are not present in the corresponding captions.

Cite

CITATION STYLE

APA

Garg, S., Moniz, J. R. A., Aviral, A., & Bollimpalli, P. (2020). Learning to relate from captions and bounding boxes. In ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (pp. 6597–6603). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/p19-1660

Learning to relate from captions and bounding boxes

Abstract

Cite

Register to see more suggestions