In this work, we present a new AI task - Vision to Action (V2A) - where an agent (robotic arm) is asked to perform a high-level task with objects (e.g. stacking) present in a scene. The agent has to suggest a plan consisting of primitive actions (e.g. simple movement, grasping) in order to successfully complete the given task. Queries are formulated in a way that forces the agent to perform visual reasoning over the presented scene before inferring the actions. We propose a novel approach based on multimodal attention for this task and demonstrate its performance on our new V2A dataset. We propose a method for building the V2A dataset by generating task instructions for each scene and designing an engine capable of assessing whether the sequence of primitives leads to a successful task completion.
CITATION STYLE
Nazarczuk, M., & Mikolajczyk, K. (2021). V2A - Vision to Action: Learning Robotic Arm Actions Based on Vision and Language. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 12624 LNCS, pp. 721–736). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-69535-4_44
Mendeley helps you to discover research relevant for your work.