Spatio-Temporal Transformer for Online Video Understanding

1Citations
Citations of this article
7Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Leading methods in the field of online video understanding try to extract useful information from the spatial and temporal dimensions of an input video. But they are suffering from two problems: (1) These methods can only extract local video information, and cannot relate to the important features of the temporal context in the video. (2) Although some methods can quickly process the information of each frame in the video, the processing efficiency of the whole video is not good, so this type of method cannot be applied to online video understanding tasks. This article introduces a Transformer-based network, which considers spatial and temporal content, and can quickly process each video at the same time. Our approach can efficiently handle up to 170 videos with hundreds of frames per second for action classification. Our method achieve 10 to 90 times faster than existing methods on the action classification datasets.

Cite

CITATION STYLE

APA

Du, Z., Zhang, G., Lu, W., Zhao, T., & Wu, P. (2022). Spatio-Temporal Transformer for Online Video Understanding. In Journal of Physics: Conference Series (Vol. 2171). IOP Publishing Ltd. https://doi.org/10.1088/1742-6596/2171/1/012020

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free