Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation

Yan Bo Lin; Yu Chiang Frank Wang

Conference ProceedingsOPEN ACCESS

Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation

35th AAAI Conference on Artificial Intelligence, AAAI 2021 (2021) 3A 2056-2063

DOI: 10.1609/aaai.v35i3.16302

17Citations

20Readers

Abstract

Human perceives rich auditory experience with distinct sound heard by ears. Videos recorded with binaural audio particular simulate how human receives ambient sound. However, a large number of videos are with monaural audio only, which would degrade the user experience due to the lack of ambient information. To address this issue, we propose an audio spatialization framework to convert a monaural video into a binaural one exploiting the relationship across audio and visual components. By preserving the left-right consistency in both audio and visual modalities, our learning strategy can be viewed as a self-supervised learning technique, and alleviates the dependency on a large amount of video data with ground truth binaural audio data during training. Experiments on benchmark datasets confirm the effectiveness of our proposed framework in both semi-supervised and fully supervised scenarios, with ablation studies and visualization further support the use of our model for audio spatialization.

Cite

CITATION STYLE

APA

Lin, Y. B., & Wang, Y. C. F. (2021). Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation. In 35th AAAI Conference on Artificial Intelligence, AAAI 2021 (Vol. 3A, pp. 2056–2063). Association for the Advancement of Artificial Intelligence. https://doi.org/10.1609/aaai.v35i3.16302

Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation

Abstract

Cite

Register to see more suggestions