Multimodal Embodied Plan Prediction Augmented with Synthetic Embodied Dialogue

Aishwarya Padmakumar; Mert Inan; Spandana Gella; Patrick L. Lange; Dilek Hakkani-Tur

Conference Proceedings

Multimodal Embodied Plan Prediction Augmented with Synthetic Embodied Dialogue

EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings (2023) 6114-6131

DOI: 10.18653/v1/2023.emnlp-main.374

4Citations

13Readers

Get full text

Abstract

Embodied task completion is a challenge where an agent in a simulated environment must predict environment actions to complete tasks based on natural language instructions and egocentric visual observations. We propose a variant of this problem where the agent predicts actions at a higher level of abstraction called a plan, which helps make agent actions more interpretable and can be obtained from the appropriate prompting of large language models. We show that multimodal transformer models can outperform language-only models for this problem but fall significantly short of oracle plans. Since collecting human-human dialogues for embodied environments is expensive and time-consuming, we propose a method to synthetically generate such dialogues, which we then use as training data for plan prediction. We demonstrate that multimodal transformer models can attain strong zero-shot performance from our synthetic data, outperforming language-only models trained on human-human data.

Cite

CITATION STYLE

APA

Padmakumar, A., Inan, M., Gella, S., Lange, P. L., & Hakkani-Tur, D. (2023). Multimodal Embodied Plan Prediction Augmented with Synthetic Embodied Dialogue. In EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings (pp. 6114–6131). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.emnlp-main.374

Multimodal Embodied Plan Prediction Augmented with Synthetic Embodied Dialogue

Abstract

Cite

Register to see more suggestions