Applying Generative Adversarial Networks and Vision Transformers in Speech Emotion Recognition

Panikos Heracleous; Satoru Fukayama; Jun Ogata; Yasser Mohammad

Conference Proceedings

Applying Generative Adversarial Networks and Vision Transformers in Speech Emotion Recognition

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2022) 13519 LNCS 67-75

DOI: 10.1007/978-3-031-17618-0_6

1Citations

6Readers

Get full text

Abstract

Automatic recognition of human emotions is of high importance in human-computer interaction (HCI) due to its applications in real-world tasks. Previously, several studies have been introduced to address the problem of emotion recognition using several kinds of sensors, feature extraction methods, and classification techniques. Specifically, emotion recognition has been reported using audio, vision, text, and biosensors. Although, using acted emotion signals, significant improvements have been achieved, emotion recognition still faces low performance due to the lack of real data and limited data size. To address this problem, in this study data augmentation is investigated based on Generative Adversarial Networks (GANs). For classification the Vision Transformer (ViT) is being used. ViT has originally been applied for image classification, but in the current study is being adopted for emotion recognition. The proposed methods have been evaluated using the English IEMOCAP and the Japanese JTES speech corpora and showed significant improvements when data augmentation has been applied.

Author supplied keywords

Cite

CITATION STYLE

APA

Heracleous, P., Fukayama, S., Ogata, J., & Mohammad, Y. (2022). Applying Generative Adversarial Networks and Vision Transformers in Speech Emotion Recognition. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 13519 LNCS, pp. 67–75). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-17618-0_6

Applying Generative Adversarial Networks and Vision Transformers in Speech Emotion Recognition

Abstract

Author supplied keywords

Cite

Register to see more suggestions