In Affective Computing, a mathematical representation of emotions in the computer is desirable for emotionally interactive agents. This study aims to obtain a latent representation of emotions (an emotional space) common to the modalities, focusing on that humans can recognize emotions from multiple modalities. We define the emotional space as the latent space of the multimodal DNN model and propose embedding emotional information into a Hemi-hyperspherical space. Our proposed model fuses the emotional spaces of each modality with an element-wise weighted average fashion. We train the model by combining emotion recognition and latent space unification tasks. The unification task is the loss of distance between emotional spaces from different modalities expressed simultaneously, which leads to acquiring a similar space from different modalities. Experiments using audio-visual data evaluate the robustness of emotion recognition against modalities missing. The results confirmed that the proposed method, especially in the low-dimensional Hemi-hyperspherical representations, could acquire a shared representation of emotion across modalities.
CITATION STYLE
Harata, S., Sakuma, T., & Kato, S. (2022). Audio-Visual Shared Emotion Representation for Robust Emotion Recognition on Modality Missing Using Hemi-hyperspherical Embedding and Latent Space Unification. In Communications in Computer and Information Science (Vol. 1581 CCIS, pp. 137–143). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-06388-6_18
Mendeley helps you to discover research relevant for your work.