In this paper, we propose a temporal graph convolutional network (TGCN) to recognize the sentiments from language (textual), acoustic, and visual (facial expressions) modalities. TGCN constructs a modality-specific graph whose nodes are the aligned segments in the multimodal utterances and edges are weighted according to the distances between their features, in order to learn node embeddings with sequential semantics underlying the utterances. In particular, we use positional encoding by interleaving sine and cosine embedding to encode the positions of the segments in the utterances into their features. Given the modality-specific embeddings of the segments in utterances, we create an attention mechanism corresponding to the segments to capture the sentiment-related ones and obtain the unified embeddings of utterances. Furthermore, we fuse the attended embeddings of the multimodel utterances and conduct the attention to capture their interaction. Finally, the fused embeddings together with their raw features are concatenated together for sentiment predictions. Extensive experiments on three publicly available datasets show that TGCN outperforms the state-of-the-art methods.
CITATION STYLE
Huang, J., Lin, Z., Yang, Z., & Liu, W. (2021). Temporal Graph Convolutional Network for Multimodal Sentiment Analysis. In ICMI 2021 - Proceedings of the 2021 International Conference on Multimodal Interaction (pp. 239–247). Association for Computing Machinery, Inc. https://doi.org/10.1145/3462244.3479939
Mendeley helps you to discover research relevant for your work.