Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization

Fa Ting Hong; Jia Chang Feng; Dan Xu; Ying Shan; Wei Shi Zheng

Conference ProceedingsOPEN ACCESS

Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization

MM 2021 - Proceedings of the 29th ACM International Conference on Multimedia (2021) 1591-1599

DOI: 10.1145/3474085.3475298

85Citations

60Readers

Get full text

Abstract

Weakly supervised temporal action localization (WS-TAL) is a challenging task that aims to localize action instances in the given video with video-level categorical supervision. Previous works use the appearance and motion features extracted from pre-trained feature encoder directly,e.g., feature concatenation or score-level fusion. In this work, we argue that the features extracted from the pre-trained extractors,e.g., I3D, which are trained for trimmed video action classification, but not specific for WS-TAL task, leading to inevitable redundancy and sub-optimization. Therefore, the feature re-calibration is needed for reducing the task-irrelevant information redundancy. Here, we propose a cross-modal consensus network(CO2-Net) to tackle this problem. In CO2-Net, we mainly introduce two identical proposed cross-modal consensus modules (CCM) that design a cross-modal attention mechanism to filter out the task-irrelevant information redundancy using the global information from the main modality and the cross-modal local information from the auxiliary modality. Moreover, we further explore inter-modality consistency, where we treat the attention weights derived from each CCM as the pseudo targets of the attention weights derived from another CCM to maintain the consistency between the predictions derived from two CCMs, forming a mutual learning manner. Finally, we conduct extensive experiments on two commonly used temporal action localization datasets, THUMOS14 and ActivityNet1.2, to verify our method, which we achieve state-of-the-art results. The experimental results show that our proposed cross-modal consensus module can produce more representative features for temporal action localization.

Author supplied keywords

Cite

CITATION STYLE

APA

Hong, F. T., Feng, J. C., Xu, D., Shan, Y., & Zheng, W. S. (2021). Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization. In MM 2021 - Proceedings of the 29th ACM International Conference on Multimedia (pp. 1591–1599). Association for Computing Machinery, Inc. https://doi.org/10.1145/3474085.3475298

Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization

Abstract

Author supplied keywords

Cite

Register to see more suggestions