Pyramid constrained self-attention network for fast video salient object detection

172Citations
Citations of this article
53Readers
Mendeley users who have this article in their library.

Abstract

Spatiotemporal information is essential for video salient object detection (VSOD) due to the highly attractive object motion for human's attention. Previous VSOD methods usually use Long Short-Term Memory (LSTM) or 3D ConvNet (C3D), which can only encode motion information through step-by-step propagation in the temporal domain. Recently, the non-local mechanism is proposed to capture long-range dependencies directly. However, it is not straightforward to apply the non-local mechanism into VSOD, because i) it fails to capture motion cues and tends to learn motion-independent global contexts; ii) its computation and memory costs are prohibitive for video dense prediction tasks such as VSOD. To address the above problems, we design a Constrained Self- Attention (CSA) operation to capture motion cues, based on the prior that objects always move in a continuous trajectory. We group a set of CSA operations in Pyramid structures (PCSA) to capture objects at various scales and speeds. Extensive experimental results demonstrate that our method outperforms previous state-of-the-art methods in both accuracy and speed (110 FPS on a single Titan Xp) on five challenge datasets. Our code is available at https://github.com/ guyuchao/PyramidCSA.

Cite

CITATION STYLE

APA

Gu, Y., Wang, L., Wang, Z., Liu, Y., Cheng, M. M., & Lu, S. P. (2020). Pyramid constrained self-attention network for fast video salient object detection. In AAAI 2020 - 34th AAAI Conference on Artificial Intelligence (pp. 10869–10876). AAAI press. https://doi.org/10.1609/aaai.v34i07.6718

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free