Integrating Temporal and Spatial Attention for Video Action Recognition

1Citations
Citations of this article
7Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

In recent years, deep convolutional neural networks (DCNN) have been widely used in the field of video action recognition. Attention mechanisms are also increasingly utilized in action recognition tasks. In this paper, we want to combine temporal and spatial attention for better video action recognition. Specifically, we learn a set of sparse attention by computing class response maps for finding the most informative region in a video frame. Each video frame is resampled with this information to form two new frames, one focusing on the most discriminative regions of the image and the other on the complementary regions of the image. After computing sparse attention all the newly generated video frames are rearranged in the order of the original video to form two new videos. These two videos are then fed into a CNN as new inputs to reinforce the learning of discriminative regions in the images (spatial attention). And the CNN we used is a network with a frame selection strategy that allows the network to focus on only some of the frames to complete the classification task (temporal attention). Finally, we combine the three video (original, discriminative, and complementary) classification results to get the final result together. Our experiments on the datasets UCF101 and HMDB51 show that our approach outperforms the best available methods.

Cite

CITATION STYLE

APA

Zhou, Y., Li, B., Wang, Z., & Li, H. (2022). Integrating Temporal and Spatial Attention for Video Action Recognition. Security and Communication Networks, 2022. https://doi.org/10.1155/2022/5094801

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free