V2A - Vision to Action: Learning Robotic Arm Actions Based on Vision and Language

N/ACitations
Citations of this article
15Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In this work, we present a new AI task - Vision to Action (V2A) - where an agent (robotic arm) is asked to perform a high-level task with objects (e.g. stacking) present in a scene. The agent has to suggest a plan consisting of primitive actions (e.g. simple movement, grasping) in order to successfully complete the given task. Queries are formulated in a way that forces the agent to perform visual reasoning over the presented scene before inferring the actions. We propose a novel approach based on multimodal attention for this task and demonstrate its performance on our new V2A dataset. We propose a method for building the V2A dataset by generating task instructions for each scene and designing an engine capable of assessing whether the sequence of primitives leads to a successful task completion.

Cite

CITATION STYLE

APA

Nazarczuk, M., & Mikolajczyk, K. (2021). V2A - Vision to Action: Learning Robotic Arm Actions Based on Vision and Language. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 12624 LNCS, pp. 721–736). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-69535-4_44

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free