Learning image embeddings using convolutional neural networks for improved multi-modal semantics

Douwe Kiela; Léon Bottou

Conference Proceedings

Learning image embeddings using convolutional neural networks for improved multi-modal semantics

EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference (2014) 36-45

DOI: 10.3115/v1/d14-1005

181Citations

277Readers

Get full text

Abstract

We construct multi-modal concept representations by concatenating a skip-gram linguistic representation vector with a visual concept representation vector computed using the feature extraction layers of a deep convolutional neural network (CNN) trained on a large labeled object recognition dataset. This transfer learning approach brings a clear performance gain over features based on the traditional bag-of-visual-word approach. Experimental results are reported on theWordSim353 and MEN semantic relatedness evaluation tasks. We use visual features computed using either ImageNet or ESP Game images.

Cite

CITATION STYLE

APA

Kiela, D., & Bottou, L. (2014). Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference (pp. 36–45). Association for Computational Linguistics (ACL). https://doi.org/10.3115/v1/d14-1005

Learning image embeddings using convolutional neural networks for improved multi-modal semantics

Abstract

Cite

Register to see more suggestions