Language meets YOLOv8 for metric monocular SLAM

12Citations
Citations of this article
12Readers
Mendeley users who have this article in their library.
Get full text

Abstract

We present a new approach that combines spoken language and visual object detection to produce a depth image to perform metric monocular SLAM in real time and without requiring a depth or stereo camera. We propose a methodology where a compact matrix representation of the language and objects, along with a partitioning algorithm, is used to resolve the association between the objects mentioned in the spoken description and the objects visually detected in the image. The spoken language is processed online using Whisper, a popular automatic speech recognition system, while the YOLOv8 network is used for object detection. Camera pose estimation and mapping of the scene are performed using ORB-SLAM. The full system runs in real time, allowing a user to explore the scene with a handheld camera, observe the objects detected by YOLOv8, and provide depth information of these objects with respect to the camera via a spoken description. We have performed experiments in indoor and outdoor scenarios, comparing the resulting camera trajectory and map obtained with our approach against that obtained when using RGB-D images. Our results are comparable to those obtained with the latter without losing real-time performance.

Author supplied keywords

Cite

CITATION STYLE

APA

Martinez-Carranza, J., Hernández-Farías, D. I., Rojas-Perez, L. O., & Cabrera-Ponce, A. A. (2023). Language meets YOLOv8 for metric monocular SLAM. Journal of Real-Time Image Processing, 20(4). https://doi.org/10.1007/s11554-023-01318-3

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free