Emotion recognition in speech using cross-modal transfer in the wild

Samuel Albanie; Arsha Nagrani; Andrea Vedaldi; Andrew Zisserman

Conference ProceedingsOPEN ACCESS

Emotion recognition in speech using cross-modal transfer in the wild

MM 2018 - Proceedings of the 2018 ACM Multimedia Conference (2018) 292-301

DOI: 10.1145/3240508.3240578

210Citations

234Readers

Abstract

Obtaining large, human labelled speech datasets to train models for emotion recognition is a notoriously challenging task, hindered by annotation cost and label ambiguity. In this work, we consider the task of learning embeddings for speech classification without access to any form of labelled audio. We base our approach on a simple hypothesis: that the emotional content of speech correlates with the facial expression of the speaker. By exploiting this relationship, we show that annotations of expression can be transferred from the visual domain (faces) to the speech domain (voices) through cross-modal distillation. We make the following contributions: (i) we develop a strong teacher network for facial emotion recognition that achieves the state of the art on a standard benchmark; (ii) we use the teacher to train a student, tabula rasa, to learn representations (embeddings) for speech emotion recognition without access to labelled audio data; and (iii) we show that the speech emotion embedding can be used for speech emotion recognition on external benchmark datasets. Code, models and data are available.

Author supplied keywords

Cite

CITATION STYLE

APA

Albanie, S., Nagrani, A., Vedaldi, A., & Zisserman, A. (2018). Emotion recognition in speech using cross-modal transfer in the wild. In MM 2018 - Proceedings of the 2018 ACM Multimedia Conference (pp. 292–301). Association for Computing Machinery, Inc. https://doi.org/10.1145/3240508.3240578

Emotion recognition in speech using cross-modal transfer in the wild

Abstract

Author supplied keywords

Cite

Register to see more suggestions