Multimodal Embodied Plan Prediction Augmented with Synthetic Embodied Dialogue

4Citations
Citations of this article
12Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Embodied task completion is a challenge where an agent in a simulated environment must predict environment actions to complete tasks based on natural language instructions and egocentric visual observations. We propose a variant of this problem where the agent predicts actions at a higher level of abstraction called a plan, which helps make agent actions more interpretable and can be obtained from the appropriate prompting of large language models. We show that multimodal transformer models can outperform language-only models for this problem but fall significantly short of oracle plans. Since collecting human-human dialogues for embodied environments is expensive and time-consuming, we propose a method to synthetically generate such dialogues, which we then use as training data for plan prediction. We demonstrate that multimodal transformer models can attain strong zero-shot performance from our synthetic data, outperforming language-only models trained on human-human data.

Cite

CITATION STYLE

APA

Padmakumar, A., Inan, M., Gella, S., Lange, P. L., & Hakkani-Tur, D. (2023). Multimodal Embodied Plan Prediction Augmented with Synthetic Embodied Dialogue. In EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings (pp. 6114–6131). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.emnlp-main.374

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free