Automatic recognition of human emotions is of high importance in human-computer interaction (HCI) due to its applications in real-world tasks. Previously, several studies have been introduced to address the problem of emotion recognition using several kinds of sensors, feature extraction methods, and classification techniques. Specifically, emotion recognition has been reported using audio, vision, text, and biosensors. Although, using acted emotion signals, significant improvements have been achieved, emotion recognition still faces low performance due to the lack of real data and limited data size. To address this problem, in this study data augmentation is investigated based on Generative Adversarial Networks (GANs). For classification the Vision Transformer (ViT) is being used. ViT has originally been applied for image classification, but in the current study is being adopted for emotion recognition. The proposed methods have been evaluated using the English IEMOCAP and the Japanese JTES speech corpora and showed significant improvements when data augmentation has been applied.
CITATION STYLE
Heracleous, P., Fukayama, S., Ogata, J., & Mohammad, Y. (2022). Applying Generative Adversarial Networks and Vision Transformers in Speech Emotion Recognition. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 13519 LNCS, pp. 67–75). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-17618-0_6
Mendeley helps you to discover research relevant for your work.