Learning Attentive Representations for Environmental Sound Classification

57Citations
Citations of this article
35Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Environmental sound classification (ESC) is a challenging problem due to the complex temporal structure and diverse energy modulation patterns of environmental sounds. In order to deal with the former, temporal attention mechanism is originally adopted to focus on the informative frames. However, no existing works pay attention to the latter problem. In this paper, we consider the role of convolution filters in detecting energy modulation patterns and propose a channel attention mechanism to focus on the semantically relevant channels generated by corresponding filters. Furthermore, we incorporate the temporal attention and channel attention to enhance the representative power of CNN via generating complementary information. In addition, to avoid possible overfitting caused by limited training data, we explore a data augmentation scheme that is other contribution in this paper. We evaluate our proposed method on three benchmark ESC datasets: ESC-10 and ESC-50 and DCASE2016. Experimental results show the effectiveness of proposed method and achieve the state-of-the-art or competitive results in terms of classification accuracy. Finally, we visualize our attention results and observe that the proposed attention mechanism is able to lead the network to focus on the semantically relevant parts of environmental sounds.

Cite

CITATION STYLE

APA

Zhang, Z., Xu, S., Zhang, S., Qiao, T., & Cao, S. (2019). Learning Attentive Representations for Environmental Sound Classification. IEEE Access, 7, 130327–130339. https://doi.org/10.1109/ACCESS.2019.2939495

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free