Human-Machine Communication by Audio-Visual Integration

  • Nakamura S
  • Yotsukura T
  • Morishima S
N/ACitations
Citations of this article
2Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The use of audio-visual information is inevitable in human communication. Complementary usage of audio-visual information enables more accurate, robust, natural, and friendly human communication in real environments. These types of information are also required for computers to realize natural and friendly interfaces, which are currently unreliable and unfriendly. In this chapter, we focus on synchronous multi-modalities, specifically audio information of speech and image information of a face for audio-visual speech recognition, synthesis and translation. Human audio speech and visual speech information both originate from movements of the speech organs triggered by motor commands from the brain. Accordingly, such speech signals represent the information of an utterance in different ways. Therefore, these audio and visual speech modalities have strong correlations and complementary relationships. There is indeed a very strong demand to improve current speech recognition performance. The performance in real environments drastically degrades when speech is exposed to acoustic noise, reverberation and speaking style differences. The integration of audio and visual information is expected to make the system robust and reliable and improve the performance. On the other hand, there is also a demand to improve speech synthesis intelligibility as well. The multi-modal speech synthesis of audio speech and lip-synchronized talking face images can improve intelligibility and naturalness. This chapter first describes audio-visual speech detection and recognition which aim to improve the robustness of speech recognition performance in actual noisy environments in Section 1. Second, a talking face synthesis system based on a 3-D mesh model and an audio-visual speech translation system are introduced. The audiovisual speech translation system recognizes input speech in an original language, translates it into a target language and synthesizes output speech in a target language in Section 2.

Cite

CITATION STYLE

APA

Nakamura, S., Yotsukura, T., & Morishima, S. (2005). Human-Machine Communication by Audio-Visual Integration (pp. 349–368). https://doi.org/10.1007/3-540-32367-8_16

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free