Abstract
Embodied task completion is a challenge where an agent in a simulated environment must predict environment actions to complete tasks based on natural language instructions and egocentric visual observations. We propose a variant of this problem where the agent predicts actions at a higher level of abstraction called a plan, which helps make agent actions more interpretable and can be obtained from the appropriate prompting of large language models. We show that multimodal transformer models can outperform language-only models for this problem but fall significantly short of oracle plans. Since collecting human-human dialogues for embodied environments is expensive and time-consuming, we propose a method to synthetically generate such dialogues, which we then use as training data for plan prediction. We demonstrate that multimodal transformer models can attain strong zero-shot performance from our synthetic data, outperforming language-only models trained on human-human data.
Cite
CITATION STYLE
Padmakumar, A., Inan, M., Gella, S., Lange, P. L., & Hakkani-Tur, D. (2023). Multimodal Embodied Plan Prediction Augmented with Synthetic Embodied Dialogue. In EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings (pp. 6114–6131). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.emnlp-main.374
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.