Unsupervised discovery of multimodal links in multi-image, multi-sentence documents

Jack Hessel; Lillian Lee; David Mimno

Conference ProceedingsOPEN ACCESS

Unsupervised discovery of multimodal links in multi-image, multi-sentence documents

EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference (2019) 2034-2045

DOI: 10.18653/v1/d19-1210

21Citations

122Readers

Abstract

Images and text co-occur constantly on the web, but explicit links between images and sentences (or other intra-document textual units) are often not present. We present algorithms that discover image-sentence relationships without relying on explicit multimodal annotation in training. We experiment on seven datasets of varying difficulty, ranging from documents consisting of groups of images captioned post hoc by crowdworkers to naturally-occurring user-generated multimodal documents. We find that a structured training objective based on identifying whether collections of images and sentences co-occur in documents can suffice to predict links between specific sentences and specific images within the same document at test time.

Cite

CITATION STYLE

APA

Hessel, J., Lee, L., & Mimno, D. (2019). Unsupervised discovery of multimodal links in multi-image, multi-sentence documents. In EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference (pp. 2034–2045). Association for Computational Linguistics. https://doi.org/10.18653/v1/d19-1210

Unsupervised discovery of multimodal links in multi-image, multi-sentence documents

Abstract

Cite

Register to see more suggestions