Recently, multimodal emotion recognition received an increasing interest due to its potential to improve performance by leveraging complementary sources of information. In this work, we explore the use of images, texts and tags for emotion recognition. However, using several modalities can also come with an additional challenge that is often ignored, namely the problem of "missing modality". Social media users do not always publish content containing an image, text and tags, and consequently one or two modalities are often missing at test time. Similarly, the labeled training data that contain all modalities can be limited. Taking this in consideration, we propose a multimodal model that leverages a multitask framework to enable the use of training data composed of an arbitrary number of modality, while it can also perform predictions with missing modalities. We show that our approach is robust to one or two missing modalities at test time. Also, with this framework it becomes easy to fine-tune some parts of our model with unimodal and bimodal training data, which can further improve overall performance. Finally, our experiments support that this multitask learning also acts as a regularization mechanism that improves generalization.
CITATION STYLE
Fortin, M. P., & Chaib-Draa, B. (2019). Multimodal Multitask Emotion Recognition using Images, Texts and Tags. In WCRML 2019 - Proceedings of the ACM Workshop on Crossmodal Learning and Application (pp. 3–10). Association for Computing Machinery, Inc. https://doi.org/10.1145/3326459.3329165
Mendeley helps you to discover research relevant for your work.