A visually-grounded first-person dialogue dataset with verbal and non-verbal responses

3Citations
Citations of this article
84Readers
Mendeley users who have this article in their library.

Abstract

In real-world dialogue, first-person visual information about where the other speakers are and what they are paying attention to is crucial to understand their intentions. Non-verbal responses also play an important role in social interactions. In this paper, we propose a visually-grounded first-person dialogue (VFD) dataset with verbal and non-verbal responses. The VFD dataset provides manually annotated (1) first-person images of agents, (2) utterances of human speakers, (3) eye-gaze locations of the speakers, and (4) the agents' verbal and nonverbal responses. We present experimental results obtained using the proposed VFD dataset and recent neural network models (e.g., BERT, ResNet). The results demonstrate that first-person vision helps neural network models correctly understand human intentions, and the production of non-verbal responses is a challenging task like that of verbal responses. Our dataset is publicly available.

Cite

CITATION STYLE

APA

Kamezawa, H., Nishida, N., Shimizu, N., Miyazaki, T., & Nakayama, H. (2020). A visually-grounded first-person dialogue dataset with verbal and non-verbal responses. In EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference (pp. 3299–3310). Association for Computational Linguistics (ACL). https://doi.org/10.5715/jnlp.28.259

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free