Language meets YOLOv8 for metric monocular SLAM

Jose Martinez-Carranza; Delia Irazú Hernández-Farías; L. Oyuki Rojas-Perez; Aldrich A. Cabrera-Ponce

Journal Article

Language meets YOLOv8 for metric monocular SLAM

Journal of Real-Time Image Processing (2023) 20(4)

DOI: 10.1007/s11554-023-01318-3

12Citations

12Readers

Get full text

Abstract

We present a new approach that combines spoken language and visual object detection to produce a depth image to perform metric monocular SLAM in real time and without requiring a depth or stereo camera. We propose a methodology where a compact matrix representation of the language and objects, along with a partitioning algorithm, is used to resolve the association between the objects mentioned in the spoken description and the objects visually detected in the image. The spoken language is processed online using Whisper, a popular automatic speech recognition system, while the YOLOv8 network is used for object detection. Camera pose estimation and mapping of the scene are performed using ORB-SLAM. The full system runs in real time, allowing a user to explore the scene with a handheld camera, observe the objects detected by YOLOv8, and provide depth information of these objects with respect to the camera via a spoken description. We have performed experiments in indoor and outdoor scenarios, comparing the resulting camera trajectory and map obtained with our approach against that obtained when using RGB-D images. Our results are comparable to those obtained with the latter without losing real-time performance.

Author supplied keywords

Cite

CITATION STYLE

APA

Martinez-Carranza, J., Hernández-Farías, D. I., Rojas-Perez, L. O., & Cabrera-Ponce, A. A. (2023). Language meets YOLOv8 for metric monocular SLAM. Journal of Real-Time Image Processing, 20(4). https://doi.org/10.1007/s11554-023-01318-3

Language meets YOLOv8 for metric monocular SLAM

Abstract

Author supplied keywords

Cite

Register to see more suggestions