Applying Generative Adversarial Networks and Vision Transformers in Speech Emotion Recognition

1Citations
Citations of this article
6Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Automatic recognition of human emotions is of high importance in human-computer interaction (HCI) due to its applications in real-world tasks. Previously, several studies have been introduced to address the problem of emotion recognition using several kinds of sensors, feature extraction methods, and classification techniques. Specifically, emotion recognition has been reported using audio, vision, text, and biosensors. Although, using acted emotion signals, significant improvements have been achieved, emotion recognition still faces low performance due to the lack of real data and limited data size. To address this problem, in this study data augmentation is investigated based on Generative Adversarial Networks (GANs). For classification the Vision Transformer (ViT) is being used. ViT has originally been applied for image classification, but in the current study is being adopted for emotion recognition. The proposed methods have been evaluated using the English IEMOCAP and the Japanese JTES speech corpora and showed significant improvements when data augmentation has been applied.

Cite

CITATION STYLE

APA

Heracleous, P., Fukayama, S., Ogata, J., & Mohammad, Y. (2022). Applying Generative Adversarial Networks and Vision Transformers in Speech Emotion Recognition. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 13519 LNCS, pp. 67–75). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-17618-0_6

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free