Learning to Learn Words from Visual Scenes

1Citations
Citations of this article
85Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Language acquisition is the process of learning words from the surrounding scene. We introduce a meta-learning framework that learns how to learn word representations from unconstrained scenes. We leverage the natural compositional structure of language to create training episodes that cause a meta-learner to learn strong policies for language acquisition. Experiments on two datasets show that our approach is able to more rapidly acquire novel words as well as more robustly generalize to unseen compositions, significantly outperforming established baselines. A key advantage of our approach is that it is data efficient, allowing representations to be learned from scratch without language pre-training. Visualizations and analysis suggest visual information helps our approach learn a rich cross-modal representation from minimal examples.

Cite

CITATION STYLE

APA

Surís, D., Epstein, D., Ji, H., Chang, S. F., & Vondrick, C. (2020). Learning to Learn Words from Visual Scenes. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 12374 LNCS, pp. 434–452). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-58526-6_26

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free