The performance of predicting human fixations in videos has been much enhanced with the help of development of the convolutional neural networks (CNN). In this paper, we propose a novel end-to-end neural network “SalSAC” for video saliency prediction, which uses the CNN-LSTM-Attention as the basic architecture and utilizes the information from both static and dynamic aspects. To better represent the static information of each frame, we first extract multi-level features of same size from different layers of the encoder CNN and calculate the corresponding multi-level attentions, then we randomly shuffle these attention maps among levels and multiply them to the extracted multi-level features respectively. Through this way, we leverage the attention consistency across different layers to improve the robustness of the network. On the dynamic aspect, we propose a correlation-based ConvLSTM to appropriately balance the influence of the current and preceding frames to the prediction. Experimental results on the DHF1K, Hollywood2 and UCF-sports datasets show that SalSAC outperforms many existing state-of-the-art methods.
CITATION STYLE
Wu, X., Wu, Z., Zhang, J., Ju, L., & Wang, S. (2020). SalSAC: A video saliency prediction model with shuffled attentions and correlation-based ConvLSTM. In AAAI 2020 - 34th AAAI Conference on Artificial Intelligence (pp. 12410–12417). AAAI press. https://doi.org/10.1609/aaai.v34i07.6927
Mendeley helps you to discover research relevant for your work.