Self-Supervised Video Action Localization with Adversarial Temporal Transforms

4Citations
Citations of this article
11Readers
Mendeley users who have this article in their library.

Abstract

Weakly-supervised temporal action localization aims to locate intervals of action instances with only video-level action labels for training. However, the localization results generated from video classification networks are often not accurate due to the lack of temporal boundary annotation of actions. Our motivating insight is that the temporal boundary of action should be stably predicted under various temporal transforms. This inspires a self-supervised equivariant transform consistency constraint. We design a set of temporal transform operations, including naive temporal down-sampling to learnable attention-piloted time warping. In our model, a localization network aims to perform well under all transforms, and another policy network is designed to choose a temporal transform at each iteration that adversarially brings localization result inconsistent with the localization network's. Additionally, we devise a self-refine module to enhance the completeness of action intervals harnessing temporal and semantic contexts. Experimental results on THUMOS14 and ActivityNet demonstrate that our model consistently outperforms the state-of-the-art weakly-supervised temporal action localization methods.

Cite

CITATION STYLE

APA

Gong, G., Zheng, L., Jiang, W., & Mu, Y. (2021). Self-Supervised Video Action Localization with Adversarial Temporal Transforms. In IJCAI International Joint Conference on Artificial Intelligence (pp. 693–699). International Joint Conferences on Artificial Intelligence. https://doi.org/10.24963/ijcai.2021/96

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free