Learning models for actions and person-object interactions with transfer to question answering

70Citations
Citations of this article
76Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

This paper proposes deep convolutional network models that utilize local and global context to make human activity label predictions in still images, achieving state-of-the-art performance on two recent datasets with hundreds of labels each. We use multiple instance learning to handle the lack of supervision on the level of individual person instances, and weighted loss to handle unbalanced training data. Further, we show how specialized features trained on these datasets can be used to improve accuracy on the Visual Question Answering (VQA) task, in the form of multiple choice fill-in-the-blank questions (Visual Madlibs). Specifically, we tackle two types of questions on person activity and person-object relationship and show improvements over generic features trained on the ImageNet classification task.

Cite

CITATION STYLE

APA

Mallya, A., & Lazebnik, S. (2016). Learning models for actions and person-object interactions with transfer to question answering. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9905 LNCS, pp. 414–428). Springer Verlag. https://doi.org/10.1007/978-3-319-46448-0_25

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free