DiffSED: Sound Event Detection with Denoising Diffusion

Swapnil Bhosale; Sauradip Nag; Diptesh Kanojia; Jiankang Deng; Xiatian Zhu

Conference ProceedingsOPEN ACCESS

DiffSED: Sound Event Detection with Denoising Diffusion

Proceedings of the AAAI Conference on Artificial Intelligence (2024) 38(2) 792-800

DOI: 10.1609/aaai.v38i2.27837

10Citations

11Readers

Abstract

Sound Event Detection (SED) aims to predict the temporal boundaries of all the events of interest and their class labels, given an unconstrained audio sample. Taking either the split- and-classify (i.e., frame-level) strategy or the more principled event-level modeling approach, all existing methods consider the SED problem from the discriminative learning perspective. In this work, we reformulate the SED problem by taking a generative learning perspective. Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process, conditioned on a target audio sample. During training, our model learns to reverse the noising process by converting noisy latent queries to the ground-truth versions in the elegant Transformer decoder framework. Doing so enables the model generate accurate event boundaries from even noisy queries during inference. Extensive experiments on the Urban-SED and EPIC-Sounds datasets demonstrate that our model significantly outperforms existing alternatives, with 40+% faster convergence in training. Code: https://github.com/Surrey-UPLab/DiffSED.

Cite

CITATION STYLE

APA

Bhosale, S., Nag, S., Kanojia, D., Deng, J., & Zhu, X. (2024). DiffSED: Sound Event Detection with Denoising Diffusion. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 38, pp. 792–800). Association for the Advancement of Artificial Intelligence. https://doi.org/10.1609/aaai.v38i2.27837

DiffSED: Sound Event Detection with Denoising Diffusion

Abstract

Cite

Register to see more suggestions