Abstract
Sound Event Detection (SED) aims to predict the temporal boundaries of all the events of interest and their class labels, given an unconstrained audio sample. Taking either the split- and-classify (i.e., frame-level) strategy or the more principled event-level modeling approach, all existing methods consider the SED problem from the discriminative learning perspective. In this work, we reformulate the SED problem by taking a generative learning perspective. Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process, conditioned on a target audio sample. During training, our model learns to reverse the noising process by converting noisy latent queries to the ground-truth versions in the elegant Transformer decoder framework. Doing so enables the model generate accurate event boundaries from even noisy queries during inference. Extensive experiments on the Urban-SED and EPIC-Sounds datasets demonstrate that our model significantly outperforms existing alternatives, with 40+% faster convergence in training. Code: https://github.com/Surrey-UPLab/DiffSED.
Cite
CITATION STYLE
Bhosale, S., Nag, S., Kanojia, D., Deng, J., & Zhu, X. (2024). DiffSED: Sound Event Detection with Denoising Diffusion. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 38, pp. 792–800). Association for the Advancement of Artificial Intelligence. https://doi.org/10.1609/aaai.v38i2.27837
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.