In the deep learning-based video action recognition, the function of the neural network is to acquire spatial information, motion information, and the associated information of the above two kinds of information over an uneven time span. We propose a network for extracting semantic information of video sequences based on the deep fusion feature of local spatial–temporal information. Convolutional neural networks (CNNs) are used in the network to extract local spatial information and local motion information, respectively. The spatial information is in three-dimensional convolution with the motion information of the corresponding time to obtain local spatial–temporal information at a certain moment. The local spatial–temporal information is then input into the long- and short-time memory (LSTM) to obtain the context relationship of the local spatial–temporal information in the long-time dimension. We add the ability of the regional attention mechanism of video frames in the neural network mechanism for obtaining context. That is, the last layer of convolutional layer spatial information and the first layer of the fully connected layer are, respectively, input into different LSTM networks, and the outputs of the two LSTMs at each time are merged again. This enables a fully connected layer that is rich in categorical information to provide a frame attention mechanism for the spatial information layer. Through the experiments on the three action recognition common experimental datasets UCF101, UCF11, and UCFSports, the spatial–temporal information deep fusion network proposed has a high correct recognition rate in the task of action recognition.
CITATION STYLE
Ou, H., & Sun, J. (2019). Spatiotemporal information deep fusion network with frame attention mechanism for video action recognition. Journal of Electronic Imaging, 28(02), 1. https://doi.org/10.1117/1.jei.28.2.023009
Mendeley helps you to discover research relevant for your work.