Temporal action localization from untrimmed videos is a fundamental task for real-world computer vision applications such as video surveillance systems. Even though a great deal of research attention has been paid to the problem, precise localization of human activities at a frame level still remains as a challenge. In this paper, we propose CoarseFine networks that learn highly discriminative features without loss of time granularity with two streams: the coarse and fine networks. The coarse network aims to classify the action category based on the global context of a video by taking advantage of the description power of successful action recognition models. On the other hand, the fine network does not deploy temporal pooling constrained with a low channel capacity. The fine network is specialized to identify the per-frame location of actions based on local semantics. This approach enables CoarseFine networks to learn find-grained representations without any temporal information loss. Our extensive experiments on two challenging benchmarks, THUMOS14 and ActivityNet-v1.3, validate that our proposed method provides a higher accuracy compared to the state-of-the-art by a remarkable margin in per-frame labeling and temporal action localization tasks while the computational cost is significantly reduced.
CITATION STYLE
Kim, J. H., & Heo, J. P. (2019). Learning coarse and fine features for precise temporal action localization. IEEE Access, 7, 149797–149809. https://doi.org/10.1109/ACCESS.2019.2946898
Mendeley helps you to discover research relevant for your work.