Learning Visual Localization of a Quadrotor Using Its Noise as Self-Supervision

10Citations
Citations of this article
16Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

We introduce an approach to train neural network models for visual object localization using a small training set, labeled with ground truth object positions and a large unlabeled one. We assume that the object to be localized emits sound, which is perceived by a microphone rigidly affixed to the camera. This information is used as the target of a cross-modal pretext task: predicting sound features from camera frames. By solving the pretext task, the model draws self-supervision from visual and audio data. The approach is well suited to robot learning: we instantiate it to localize a small quadrotor from 128 #x00D7; 80 pixel images acquired by a ground robot. Experiments on a separate testing set show that introducing the auxiliary pretext task yields large performance improvements: the Mean Absolute Error (MAE) of the estimated image coordinates of the target is reduced from 7 to 4 pixels; the MAE of the estimated distance is reduced from 28 cm to 14 cm. A model that has access to labels for the entire training set yields an MAE of 2 pixels and 11 cm, respectively.

Cite

CITATION STYLE

APA

Nava, M., Paolillo, A., Guzzi, J., Gambardella, L. M., & Giusti, A. (2022). Learning Visual Localization of a Quadrotor Using Its Noise as Self-Supervision. IEEE Robotics and Automation Letters, 7(2), 2218–2225. https://doi.org/10.1109/LRA.2022.3143565

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free