Abstract
In this work we address the Next Speaker Prediction sub challenge of the ACM '21 MultiMediate Grand Challenge. This challenge poses the problem of turn taking prediction in physically situated multiparty interaction. Solving this problem is essential for enabling fluent real-time multiparty human-machine interaction. This problem is made more difficult by the need for a robust solution that can perform effectively across a wide variety of settings and contexts. Prior work has shown that current state-of-the-art methods rely on machine learning approaches that do not generalize well to new settings and feature distributions. To address this problem, we propose the use of group-level focus of visual attention as additional information. We show that a simple combination of group-level focus of visual attention features and publicly available audio-video synchronizer models is competitive with state-of-the-art methods fine-tuned for the challenge dataset.
Author supplied keywords
Cite
CITATION STYLE
Birmingham, C., Stefanov, K., & Mataric, M. J. (2021). Group-Level Focus of Visual Attention for Improved Next Speaker Prediction. In MM 2021 - Proceedings of the 29th ACM International Conference on Multimedia (pp. 4838–4842). Association for Computing Machinery, Inc. https://doi.org/10.1145/3474085.3479213
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.