Time-frequency deep representation learning for speech emotion recognition integrating self-attention

6Citations
Citations of this article
14Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Learning efficient deep representations from spectrogram for speech emotion recognition still represents a significant challenge. Most existing spectrogram feature extraction methods empowered by deep learning have demonstrated great success, but the respective changing information of time and frequency exhibited by the spectrogram is ignored. In this paper, a speech emotion recognition method integrating self-attention is proposed by considering the interactive and respective changing information of time and frequency. To learn the deep representations from spectrogram, a time-frequency convolutional neural network (TFCNN) is proposed at first. After that, a Multi-head Self-attention layer inspired by Transformer proposed by Google is introduced to fuse deep representations more efficiently. Finally, extreme learning machine (ELM) and bidirectional long short term memory (BLSTM) models are adopted as emotion classifiers. Experiments conducted on IEMOCAP dataset demonstrate the effectiveness of our proposed methods showing better visual illustrations and classification results.

Cite

CITATION STYLE

APA

Liu, J., Liu, Z., Wang, L., Guo, L., & Dang, J. (2019). Time-frequency deep representation learning for speech emotion recognition integrating self-attention. In Communications in Computer and Information Science (Vol. 1142 CCIS, pp. 681–689). Springer. https://doi.org/10.1007/978-3-030-36808-1_74

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free