Extracting visual knowledge from the web with multimodal learning

4Citations
Citations of this article
25Readers
Mendeley users who have this article in their library.

Abstract

We consider the problem of automatically extracting visual objects from web images. Despite the extraordinary advancement in deep learning, visual object detection remains a challenging task. To overcome the deficiency of pure visual techniques, we propose to make use of meta text surrounding images on the Web for enhanced detection accuracy. In this paper we present a multimodal learning algorithm to integrate text information into visual knowledge extraction. To demonstrate the effectiveness of our approach, we developed a system that takes raw web pages and a small set of training images from ImageNet as inputs, and automatically extracts visual knowledge (e.g. object bounding boxes) from tens of millions of images crawled from the Web. Experimental results based on 46 object categories show that the extraction precision is improved significantly from 73% (with state-ofthe-art deep learning programs) to 81%, which is equivalent to a 31% reduction in error rates.

Cite

CITATION STYLE

APA

Gong, D., & Wang, D. Z. (2017). Extracting visual knowledge from the web with multimodal learning. In IJCAI International Joint Conference on Artificial Intelligence (Vol. 0, pp. 1718–1724). International Joint Conferences on Artificial Intelligence. https://doi.org/10.24963/ijcai.2017/238

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free