Isolated single sound lip-reading using a frame-based camera and event-based camera

3Citations
Citations of this article
2Readers
Mendeley users who have this article in their library.

Abstract

Unlike the conventional frame-based camera, the event-based camera detects changes in the brightness value for each pixel over time. This research work on lip-reading as a new application by the event-based camera. This paper proposes an event camera-based lip-reading for isolated single sound recognition. The proposed method consists of imaging from event data, face and facial feature points detection, and recognition using a Temporal Convolutional Network. Furthermore, this paper proposes a method that combines the two modalities of the frame-based camera and an event-based camera. In order to evaluate the proposed method, the utterance scenes of 15 Japanese consonants from 20 speakers were collected using an event-based camera and a video camera and constructed an original dataset. Several experiments were conducted by generating images at multiple frame rates from an event-based camera. As a result, the highest recognition accuracy was obtained in the image of the event-based camera at 60 fps. Moreover, it was confirmed that combining two modalities yields higher recognition accuracy than a single modality.

Cite

CITATION STYLE

APA

Kanamaru, T., Arakane, T., & Saitoh, T. (2023). Isolated single sound lip-reading using a frame-based camera and event-based camera. Frontiers in Artificial Intelligence, 5. https://doi.org/10.3389/frai.2022.1070964

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free