Crossmodal Clustered Contrastive Learning: Grounding of Spoken Language to Gesture

5Citations
Citations of this article
9Readers
Mendeley users who have this article in their library.

Abstract

Crossmodal grounding is a key technical challenge when generating relevant and well-timed gestures from spoken language. Often, the same gesture can accompany semantically different spoken language phrases which makes crossmodal grounding especially challenging. For example, a gesture (semi-circular with both hands) could co-occur with semantically different phrases "entire bottom row"(referring to a physical point) and "molecules expand and decay"(referring to a scientific phenomena). In this paper, we introduce a self-supervised approach to learn representations better suited to such many-to-one grounding relationships between spoken language and gestures. As part of this approach, we propose a new contrastive loss function, Crossmodal Cluster NCE, that guides the model to learn spoken language representations which are consistent with the similarities in the gesture space. This gesture-aware space can help us generate more relevant gestures given language as input. We demonstrate the effectiveness of our approach on a publicly available dataset through quantitative and qualitative evaluations. Our proposed methodology significantly outperforms prior approaches for gestures-language grounding. Link to code: https://github.com/dondongwon/CC_NCE_GENEA.

Cite

CITATION STYLE

APA

Lee, D. W., Ahuja, C., & Morency, L. P. (2021). Crossmodal Clustered Contrastive Learning: Grounding of Spoken Language to Gesture. In ICMI 2021 Companion - Companion Publication of the 2021 International Conference on Multimodal Interaction (pp. 202–210). Association for Computing Machinery, Inc. https://doi.org/10.1145/3461615.3485408

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free