Training strategies to handle missing modalities for audio-visual expression recognition

84Citations
Citations of this article
27Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Automatic audio-visual expression recognition can play an important role in communication services such as tele-health, VOIP calls and human-machine interaction. Accuracy of audio-visual expression recognition could benefit from the interplay between the two modalities. However, most audio-visual expression recognition systems, trained in ideal conditions, fail to generalize in real world scenarios where either the audio or visual modality could be missing due to a number of reasons such as limited bandwidth, interactors' orientation, caller initiated muting. This paper studies the performance of a state-of-the art transformer when one of the modalities is missing. We conduct ablation studies to evaluate the model in the absence of either modality. Further, we propose a strategy to randomly ablate visual inputs during training at the clip or frame level to mimic real world scenarios. Results conducted on in-the-wild data, indicate significant generalization in proposed models trained on missing cues, with gains up to 17% for frame level ablations, showing that these training strategies cope better with the loss of input modalities.

Cite

CITATION STYLE

APA

Parthasarathy, S., & Sundaram, S. (2020). Training strategies to handle missing modalities for audio-visual expression recognition. In ICMI 2020 Companion - Companion Publication of the 2020 International Conference on Multimodal Interaction (pp. 400–404). Association for Computing Machinery, Inc. https://doi.org/10.1145/3395035.3425202

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free