Exploiting spectro-temporal locality in deep learning based acoustic event detection

Miquel Espi; Masakiyo Fujimoto; Keisuke Kinoshita; Tomohiro Nakatani

Journal ArticleOPEN ACCESS

Exploiting spectro-temporal locality in deep learning based acoustic event detection

Eurasip Journal on Audio, Speech, and Music Processing (2015) 2015(1)

DOI: 10.1186/s13636-015-0069-2

N/ACitations

106Readers

Abstract

In recent years, deep learning has not only permeated the computer vision and speech recognition research fields but also fields such as acoustic event detection (AED). One of the aims of AED is to detect and classify non-speech acoustic events occurring in conversation scenes including those produced by both humans and the objects that surround us. In AED, deep learning has enabled modeling of detail-rich features, and among these, high resolution spectrograms have shown a significant advantage over existing predefined features (e.g., Mel-filter bank) that compress and reduce detail. In this paper, we further asses the importance of feature extraction for deep learning-based acoustic event detection. AED, based on spectrogram-input deep neural networks, exploits the fact that sounds have “global” spectral patterns, but sounds also have “local” properties such as being more transient or smoother in the time-frequency domain. These can be exposed by adjusting the time-frequency resolution used to compute the spectrogram, or by using a model that exploits locality leading us to explore two different feature extraction strategies in the context of deep learning: (1) using multiple resolution spectrograms simultaneously and analyzing the overall and event-wise influence to combine the results, and (2) introducing the use of convolutional neural networks (CNN), a state of the art 2D feature extraction model that exploits local structures, with log power spectrogram input for AED. An experimental evaluation shows that the approaches we describe outperform our state-of-the-art deep learning baseline with a noticeable gain in the CNN case and provides insights regarding CNN-based spectrogram characterization for AED.

Author supplied keywords

Cite

CITATION STYLE

APA

Espi, M., Fujimoto, M., Kinoshita, K., & Nakatani, T. (2015). Exploiting spectro-temporal locality in deep learning based acoustic event detection. Eurasip Journal on Audio, Speech, and Music Processing, 2015(1). https://doi.org/10.1186/s13636-015-0069-2

Exploiting spectro-temporal locality in deep learning based acoustic event detection

Abstract

Author supplied keywords

Cite

Register to see more suggestions