Non-local NetVLAD encoding for video classification

5Citations
Citations of this article
38Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

This paper describes our solution for the 2 nd YouTube-8M video understanding challenge organized by Google AI. Unlike the video recognition benchmarks, such as Kinetics and Moments, the YouTube-8M challenge provides pre-extracted visual and audio features instead of raw videos. In this challenge, the submitted model is restricted to 1 GB, which encourages participants focus on constructing one powerful single model rather than incorporating of the results from a bunch of models. Our system fuses six different sub-models into one single computational graph, which are categorized into three families. More specifically, the most effective family is the model with non-local operations following the NetVLAD encoding. The other two family models are Soft-BoF and GRU, respectively. In order to further boost single models performance, the model parameters of different checkpoints are averaged. Experimental results demonstrate that our proposed system can effectively perform the video classification task, achieving 0.88763 on the public test set and 0.88704 on the private set in terms of GAP@20, respectively. We finally ranked at the fourth place in the YouTube-8M video understanding challenge.

Cite

CITATION STYLE

APA

Tang, Y., Zhang, X., Wang, J., Chen, S., Ma, L., & Jiang, Y. G. (2019). Non-local NetVLAD encoding for video classification. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11132 LNCS, pp. 219–228). Springer Verlag. https://doi.org/10.1007/978-3-030-11018-5_20

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free