Learning coarse and fine features for precise temporal action localization

5Citations
Citations of this article
6Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Temporal action localization from untrimmed videos is a fundamental task for real-world computer vision applications such as video surveillance systems. Even though a great deal of research attention has been paid to the problem, precise localization of human activities at a frame level still remains as a challenge. In this paper, we propose CoarseFine networks that learn highly discriminative features without loss of time granularity with two streams: the coarse and fine networks. The coarse network aims to classify the action category based on the global context of a video by taking advantage of the description power of successful action recognition models. On the other hand, the fine network does not deploy temporal pooling constrained with a low channel capacity. The fine network is specialized to identify the per-frame location of actions based on local semantics. This approach enables CoarseFine networks to learn find-grained representations without any temporal information loss. Our extensive experiments on two challenging benchmarks, THUMOS14 and ActivityNet-v1.3, validate that our proposed method provides a higher accuracy compared to the state-of-the-art by a remarkable margin in per-frame labeling and temporal action localization tasks while the computational cost is significantly reduced.

Cite

CITATION STYLE

APA

Kim, J. H., & Heo, J. P. (2019). Learning coarse and fine features for precise temporal action localization. IEEE Access, 7, 149797–149809. https://doi.org/10.1109/ACCESS.2019.2946898

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free