This paper considers the task of action detection in long untrimmed video. Existing methods tend to process every single frame or fragment through the whole video to make detection decisions, which can not only be time-consuming but also burden the computational models. Instead, we present an attention-based model to perform action detection by watching only a few fragments, which is independent with the video length and can be applied to real-world videos consequently. Our motivation is inspired by the observation that human usually focus their attention sequentially on different frames of a video to quickly narrow down the extent where an action occurs. Our model is a two-phase architecture, where a temporal proposal network is designed to predict temporal proposals for multi-category actions in the first phase. The temporal proposal network observes a fixed number of locations in a video to predict action bounds and learn a location transfer policy. In the second phase, a well-trained classifier is prepared to extract visual information from proposals, to classify the action and decide whether to adopt the proposals. We evaluate our model on ActivityNet dataset and show it can significantly outperform the baseline.
CITATION STYLE
Chen, X., Wang, W., Li, W., & Wang, J. (2017). Attention-based two-phase model for video action detection. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10425 LNCS, pp. 81–93). Springer Verlag. https://doi.org/10.1007/978-3-319-64698-5_8
Mendeley helps you to discover research relevant for your work.