a Daptive F Eature a Bstraction

  • Gan Z
  • Carin L
  • Min M
N/ACitations
Citations of this article
10Readers
Mendeley users who have this article in their library.

Abstract

A new model for video captioning is developed, using a deep three-dimensional Convolutional Neural Network (C3D) as an encoder for videos and a Recurrent Neural Network (RNN) as a decoder for captions. A novel attention mechanism with spatiotemporal alignment is employed to adaptively and sequentially focus on different layers of CNN features (levels of feature " abstraction "), as well as local spatiotemporal regions of the feature maps at each layer. The proposed approach is evaluated on the YouTube2Text benchmark. Experimental results demonstrate quantitatively the effectiveness of our proposed adaptive spatiotem-poral feature abstraction for translating videos to sentences with rich semantic structures.

Cite

CITATION STYLE

APA

Gan, Z., Carin, L., & Min, M. R. (2017). a Daptive F Eature a Bstraction. Iclr, (2014), 1–4.

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free