Large scale audiovisual learning of sounds with weakly labeled data

Haytham M. Fayek; Anurag Kumar

Conference ProceedingsOPEN ACCESS

Large scale audiovisual learning of sounds with weakly labeled data

IJCAI International Joint Conference on Artificial Intelligence (2020) 2021-January 558-565

DOI: 10.24963/ijcai.2020/78

16Citations

37Readers

Abstract

Recognizing sounds is a key aspect of computational audio scene analysis and machine perception. In this paper, we advocate that sound recognition is inherently a multi-modal audiovisual task in that it is easier to differentiate sounds using both the audio and visual modalities as opposed to one or the other. We present an audiovisual fusion model that learns to recognize sounds from weakly labeled video recordings. The proposed fusion model utilizes an attention mechanism to dynamically combine the outputs of the individual audio and visual models. Experiments on the large scale sound events dataset, AudioSet, demonstrate the efficacy of the proposed model, which outperforms the single-modal models, and state-of-the-art fusion and multi-modal models. We achieve a mean Average Precision (mAP) of 46.16 on Audioset, outperforming prior state of the art by approximately +4.35 mAP (relative: 10.4%).

References Powered by Scopus

View more at Scopus

Cited by Powered by Scopus

View more at Scopus

Cite

CITATION STYLE

APA

Fayek, H. M., & Kumar, A. (2020). Large scale audiovisual learning of sounds with weakly labeled data. In IJCAI International Joint Conference on Artificial Intelligence (Vol. 2021-January, pp. 558–565). International Joint Conferences on Artificial Intelligence. https://doi.org/10.24963/ijcai.2020/78

Readers' Seniority

PhD / Post grad / Masters / Doc 14

88%

Researcher 2

13%

Readers' Discipline

Computer Science 14

78%

Engineering 2

11%

Business, Management and Accounting 1

Psychology 1

Large scale audiovisual learning of sounds with weakly labeled data

Abstract

References Powered by Scopus

Deep residual learning for image recognition

ImageNet: A Large-Scale Hierarchical Image Database

Multimodal Machine Learning: A Survey and Taxonomy

Cited by Powered by Scopus

Multimodal Co-learning: Challenges, applications with datasets, recent advances and future directions

Positive Sample Propagation along the Audio-Visual Event Line

Contrastive Positive Sample Propagation Along the Audio-Visual Event Line

Register to see more suggestions

Cite

Readers' Seniority

Readers' Discipline