Learning efficient deep representations from spectrogram for speech emotion recognition still represents a significant challenge. Most existing spectrogram feature extraction methods empowered by deep learning have demonstrated great success, but the respective changing information of time and frequency exhibited by the spectrogram is ignored. In this paper, a speech emotion recognition method integrating self-attention is proposed by considering the interactive and respective changing information of time and frequency. To learn the deep representations from spectrogram, a time-frequency convolutional neural network (TFCNN) is proposed at first. After that, a Multi-head Self-attention layer inspired by Transformer proposed by Google is introduced to fuse deep representations more efficiently. Finally, extreme learning machine (ELM) and bidirectional long short term memory (BLSTM) models are adopted as emotion classifiers. Experiments conducted on IEMOCAP dataset demonstrate the effectiveness of our proposed methods showing better visual illustrations and classification results.
CITATION STYLE
Liu, J., Liu, Z., Wang, L., Guo, L., & Dang, J. (2019). Time-frequency deep representation learning for speech emotion recognition integrating self-attention. In Communications in Computer and Information Science (Vol. 1142 CCIS, pp. 681–689). Springer. https://doi.org/10.1007/978-3-030-36808-1_74
Mendeley helps you to discover research relevant for your work.