V2A - Vision to Action: Learning Robotic Arm Actions Based on Vision and Language

Michal Nazarczuk; Krystian Mikolajczyk

Conference Proceedings

V2A - Vision to Action: Learning Robotic Arm Actions Based on Vision and Language

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2021) 12624 LNCS 721-736

DOI: 10.1007/978-3-030-69535-4_44

N/ACitations

15Readers

Get full text

Abstract

In this work, we present a new AI task - Vision to Action (V2A) - where an agent (robotic arm) is asked to perform a high-level task with objects (e.g. stacking) present in a scene. The agent has to suggest a plan consisting of primitive actions (e.g. simple movement, grasping) in order to successfully complete the given task. Queries are formulated in a way that forces the agent to perform visual reasoning over the presented scene before inferring the actions. We propose a novel approach based on multimodal attention for this task and demonstrate its performance on our new V2A dataset. We propose a method for building the V2A dataset by generating task instructions for each scene and designing an engine capable of assessing whether the sequence of primitives leads to a successful task completion.

Cite

CITATION STYLE

APA

Nazarczuk, M., & Mikolajczyk, K. (2021). V2A - Vision to Action: Learning Robotic Arm Actions Based on Vision and Language. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 12624 LNCS, pp. 721–736). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-69535-4_44

V2A - Vision to Action: Learning Robotic Arm Actions Based on Vision and Language

Abstract

Cite

Register to see more suggestions