Temporal Modular Networks for Retrieving Complex Compositional Activities in Videos

9Citations
Citations of this article
157Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

A major challenge in computer vision is scaling activity understanding to the long tail of complex activities without requiring collecting large quantities of data for new actions. The task of video retrieval using natural language descriptions seeks to address this through rich, unconstrained supervision about complex activities. However, while this formulation offers hope of leveraging underlying compositional structure in activity descriptions, existing approaches typically do not explicitly model compositional reasoning. In this work, we introduce an approach for explicitly and dynamically reasoning about compositional natural language descriptions of activity in videos. We take a modular neural network approach that, given a natural language query, extracts the semantic structure to assemble a compositional neural network layout and corresponding network modules. We show that this approach is able to achieve state-of-the-art results on the DiDeMo video retrieval dataset.

Cite

CITATION STYLE

APA

Liu, B., Yeung, S., Chou, E., Huang, D. A., Fei-Fei, L., & Niebles, J. C. (2018). Temporal Modular Networks for Retrieving Complex Compositional Activities in Videos. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11207 LNCS, pp. 569–586). Springer Verlag. https://doi.org/10.1007/978-3-030-01219-9_34

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free