SalSAC: A video saliency prediction model with shuffled attentions and correlation-based ConvLSTM

52Citations
Citations of this article
18Readers
Mendeley users who have this article in their library.

Abstract

The performance of predicting human fixations in videos has been much enhanced with the help of development of the convolutional neural networks (CNN). In this paper, we propose a novel end-to-end neural network “SalSAC” for video saliency prediction, which uses the CNN-LSTM-Attention as the basic architecture and utilizes the information from both static and dynamic aspects. To better represent the static information of each frame, we first extract multi-level features of same size from different layers of the encoder CNN and calculate the corresponding multi-level attentions, then we randomly shuffle these attention maps among levels and multiply them to the extracted multi-level features respectively. Through this way, we leverage the attention consistency across different layers to improve the robustness of the network. On the dynamic aspect, we propose a correlation-based ConvLSTM to appropriately balance the influence of the current and preceding frames to the prediction. Experimental results on the DHF1K, Hollywood2 and UCF-sports datasets show that SalSAC outperforms many existing state-of-the-art methods.

Cite

CITATION STYLE

APA

Wu, X., Wu, Z., Zhang, J., Ju, L., & Wang, S. (2020). SalSAC: A video saliency prediction model with shuffled attentions and correlation-based ConvLSTM. In AAAI 2020 - 34th AAAI Conference on Artificial Intelligence (pp. 12410–12417). AAAI press. https://doi.org/10.1609/aaai.v34i07.6927

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free