In this paper, we focus on solving a new task called TALL (Temporal Activity Localization via Language Query). The goal of it is to use nature language queries to localize actions in longer, untrimmed videos. We propose a new model called VAL (Visual-attention Action Localizer) to address it. Specifically, it employs voxel-wise attention and channel-wise attention on last conv-layer feature maps. These two visual attention are designed corresponding to the characteristics of feature maps. They can enhance the visual representations and boost the cross-modal correlation extraction process. Experimental results on TaCoS and Charades-STA datasets both show the effectiveness of our model.
CITATION STYLE
Song, X., & Han, Y. (2018). Val: Visual-attention action localizer. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11165 LNCS, pp. 340–350). Springer Verlag. https://doi.org/10.1007/978-3-030-00767-6_32
Mendeley helps you to discover research relevant for your work.