NeXtVLAD: An efficient neural network to aggregate frame-level features for large-scale video classification

27Citations
Citations of this article
183Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

This paper introduces a fast and efficient network architecture, NeXtVLAD, to aggregate frame-level features into a compact feature vector for large-scale video classification. Briefly speaking, the basic idea is to decompose a high-dimensional feature into a group of relatively low-dimensional vectors with attention before applying NetVLAD aggregation over time. This NeXtVLAD approach turns out to be both effective and parameter efficient in aggregating temporal information. In the 2nd Youtube-8M video understanding challenge, a single NeXtVLAD model with less than 80M parameters achieves a GAP score of 0.87846 in private leaderboard. A mixture of 3 NeXtVLAD models results in 0.88722, which is ranked 3rd over 394 teams. The code is publicly available at https://github.com/linrongc/youtube-8m.

Cite

CITATION STYLE

APA

Lin, R., Xiao, J., & Fan, J. (2019). NeXtVLAD: An efficient neural network to aggregate frame-level features for large-scale video classification. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11132 LNCS, pp. 206–218). Springer Verlag. https://doi.org/10.1007/978-3-030-11018-5_19

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free