Audio–Visual Segmentation

Jinxing Zhou; Jianyuan Wang; Jiayi Zhang; Weixuan Sun; Jing Zhang; Stan Birchfield; Dan Guo; Lingpeng Kong; Meng Wang; Yiran Zhong

Conference Proceedings

Audio–Visual Segmentation

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2022) 13697 LNCS 386-403

DOI: 10.1007/978-3-031-19836-6_22

23Citations

46Readers

Get full text

Abstract

We propose to explore a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the first audio-visual segmentation benchmark (AVSBench), providing pixel-wise annotations for the sounding objects in audible videos. Two settings are studied with this benchmark: 1) semi-supervised audio-visual segmentation with a single sound source and 2) fully-supervised audio-visual segmentation with multiple sound sources. To deal with the AVS problem, we propose a new method that uses a temporal pixel-wise audio-visual interaction module to inject audio semantics as guidance for the visual segmentation process. We also design a regularization loss to encourage the audio-visual mapping during training. Quantitative and qualitative experiments on the AVSBench compare our approach to several existing methods from related tasks, demonstrating that the proposed method is promising for building a bridge between the audio and pixel-wise visual semantics. Code is available at https://github.com/OpenNLPLab/AVSBench.

Author supplied keywords

Cite

CITATION STYLE

APA

Zhou, J., Wang, J., Zhang, J., Sun, W., Zhang, J., Birchfield, S., … Zhong, Y. (2022). Audio–Visual Segmentation. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 13697 LNCS, pp. 386–403). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-19836-6_22

Audio–Visual Segmentation

Abstract

Author supplied keywords

Cite

Register to see more suggestions