ViPER: Video-based Perceiver for Emotion Recognition

21Citations
Citations of this article
12Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Recognizing human emotions from videos requires a deep understanding of the underlying multimodal sources, including images, audio, and text. Since the input data sources are highly variable across different modality combinations, leveraging multiple modalities often requires ad hoc fusion networks. To predict the emotional arousal of a person reacting to a given video clip we present ViPER, a multimodal architecture leveraging a modality-agnostic transformer based model to combine video frames, audio recordings, and textual annotations. Specifically, it relies on a modality-agnostic late fusion network which makes ViPER easily adaptable to different modalities. The experiments carried out on the Hume-Reaction datasets of the MuSe-Reaction challenge confirm the effectiveness of the proposed approach.

Cite

CITATION STYLE

APA

Vaiani, L., La Quatra, M., Cagliero, L., & Garza, P. (2022). ViPER: Video-based Perceiver for Emotion Recognition. In MuSe 2022 - Proceedings of the 3rd International Multimodal Sentiment Analysis Workshop and Challenge (pp. 67–73). Association for Computing Machinery, Inc. https://doi.org/10.1145/3551876.3554806

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free