Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization

Haoming Xu; Runhao Zeng; Qingyao Wu; Mingkui Tan; Chuang Gan

Conference ProceedingsOPEN ACCESS

Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization

MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia (2020) 3893-3901

DOI: 10.1145/3394171.3413581

58Citations

38Readers

Get full text

Abstract

We address the challenging task of event localization, which requires the machine to localize an event and recognize its category in unconstrained videos. Most existing methods leverage only the visual information of a video while neglecting its audio information, which, however, can be very helpful and important for event localization. For example, humans often recognize an event by reasoning with the visual and audio content simultaneously. Moreover, the audio information can guide the model to pay more attention on the informative regions of visual scenes, which can help to reduce the interference brought by the background. Motivated by these, in this paper, we propose a relation-aware network to leverage both audio and visual information for accurate event localization. Specifically, to reduce the interference brought by the background, we propose an audio-guided spatial-channel attention module to guide the model to focus on event-relevant visual regions. Besides, we propose to build connections between visual and audio modalities with a relation-aware module. In particular, we learn the representations of video and/or audio segments by aggregating information from the other modality according to the cross-modal relations. Last, relying on the relation-aware representations, we conduct event localization by predicting the event relevant score and classification score. Extensive experimental results demonstrate that our method significantly outperforms the state-of-the-arts in both supervised and weakly-supervised AVE settings.

Author supplied keywords

Cite

CITATION STYLE

APA

Xu, H., Zeng, R., Wu, Q., Tan, M., & Gan, C. (2020). Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization. In MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia (pp. 3893–3901). Association for Computing Machinery, Inc. https://doi.org/10.1145/3394171.3413581

Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization

Abstract

Author supplied keywords

Cite

Register to see more suggestions