Crossmodal grounding is a key technical challenge when generating relevant and well-timed gestures from spoken language. Often, the same gesture can accompany semantically different spoken language phrases which makes crossmodal grounding especially challenging. For example, a gesture (semi-circular with both hands) could co-occur with semantically different phrases "entire bottom row"(referring to a physical point) and "molecules expand and decay"(referring to a scientific phenomena). In this paper, we introduce a self-supervised approach to learn representations better suited to such many-to-one grounding relationships between spoken language and gestures. As part of this approach, we propose a new contrastive loss function, Crossmodal Cluster NCE, that guides the model to learn spoken language representations which are consistent with the similarities in the gesture space. This gesture-aware space can help us generate more relevant gestures given language as input. We demonstrate the effectiveness of our approach on a publicly available dataset through quantitative and qualitative evaluations. Our proposed methodology significantly outperforms prior approaches for gestures-language grounding. Link to code: https://github.com/dondongwon/CC_NCE_GENEA.
CITATION STYLE
Lee, D. W., Ahuja, C., & Morency, L. P. (2021). Crossmodal Clustered Contrastive Learning: Grounding of Spoken Language to Gesture. In ICMI 2021 Companion - Companion Publication of the 2021 International Conference on Multimodal Interaction (pp. 202–210). Association for Computing Machinery, Inc. https://doi.org/10.1145/3461615.3485408
Mendeley helps you to discover research relevant for your work.