You Said That?: Synthesising Talking Faces from Audio

127Citations
Citations of this article
114Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

We describe a method for generating a video of a talking face. The method takes still images of the target face and an audio speech segment as inputs, and generates a video of the target face lip synched with the audio. The method runs in real time and is applicable to faces and audio not seen at training time. To achieve this we develop an encoder–decoder convolutional neural network (CNN) model that uses a joint embedding of the face and audio to generate synthesised talking face video frames. The model is trained on unlabelled videos using cross-modal self-supervision. We also propose methods to re-dub videos by visually blending the generated face into the source video frame using a multi-stream CNN model.

Cite

CITATION STYLE

APA

Jamaludin, A., Chung, J. S., & Zisserman, A. (2019). You Said That?: Synthesising Talking Faces from Audio. International Journal of Computer Vision, 127(11–12), 1767–1779. https://doi.org/10.1007/s11263-019-01150-y

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free