Shorter-is-better: Venue category estimation from micro-video

73Citations
Citations of this article
27Readers
Mendeley users who have this article in their library.
Get full text

Abstract

According to our statistics on over 2 million micro-videos, only 1.22% of them are associated with venue information, which greatly hinders the location-oriented applications and personalized services. To alleviate this problem, we aim to label the bite-sized video clips with venue categories. It is, however, nontrivial due to three reasons: 1) no available benchmark dataset; 2) insufficient information, low quality, and information loss; and 3) complex relatedness among venue categories. Towards this end, we propose a scheme comprising of two components. In particular, we first crawl a representative set of micro-videos from Vine and extract a rich set of features from textual, visual and acoustic modalities. We then, in the second component, build a tree-guided multi-task multi-modal learning model to estimate the venue category for each unseen micro-video. This model is able to jointly learn a common space from multi-modalities and leverage the predefined Foursquare hierarchical structure to regularize the relatedness among venue categories. Extensive experiments have well-validated our model. As a side research contribution, we have released our data, codes and involved parameters.

Cite

CITATION STYLE

APA

Zhang, J., Nie, L., Wang, X., He, X., Huang, X., & Chua, T. S. (2016). Shorter-is-better: Venue category estimation from micro-video. In MM 2016 - Proceedings of the 2016 ACM Multimedia Conference (pp. 1415–1424). Association for Computing Machinery, Inc. https://doi.org/10.1145/2964284.2964307

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free