Speaker-adapted neural-network-based fusion for multimodal reference resolution

Diana Kleingarn; Martin Heckmann; Nima Nabizadeh; Dorothea Kolossa

Conference ProceedingsOPEN ACCESS

Speaker-adapted neural-network-based fusion for multimodal reference resolution

SIGDIAL 2019 - 20th Annual Meeting of the Special Interest Group Discourse Dialogue - Proceedings of the Conference (2019) 210-214

DOI: 10.18653/v1/w19-5925

3Citations

62Readers

Abstract

Humans use a variety of approaches to reference objects in the external world, including verbal descriptions, hand and head gestures, eye gaze or any combination of them. The amount of useful information from each modality, however, may vary depending on the specific person and on several other factors. For this reason, it is important to learn the correct combination of inputs for inferring the best-fitting reference. In this paper, we investigate speaker-dependent and independent fusion strategies in a multimodal reference resolution task. We show that without any change in the modality models, only through an optimized fusion technique, it is possible to reduce the error rate of the system on a reference resolution task by more than 50%.

Cite

CITATION STYLE

APA

Kleingarn, D., Heckmann, M., Nabizadeh, N., & Kolossa, D. (2019). Speaker-adapted neural-network-based fusion for multimodal reference resolution. In SIGDIAL 2019 - 20th Annual Meeting of the Special Interest Group Discourse Dialogue - Proceedings of the Conference (pp. 210–214). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w19-5925

Speaker-adapted neural-network-based fusion for multimodal reference resolution

Abstract

Cite

Register to see more suggestions