A Lip Sync Expert Is All You Need for Speech to Lip Generation in the Wild

434Citations
Citations of this article
371Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In this work, we investigate the problem of lip-syncing a talking face video of an arbitrary identity to match a target speech segment. Current works excel at producing accurate lip movements on a static image or videos of specific people seen during the training phase. However, they fail to accurately morph the lip movements of arbitrary identities in dynamic, unconstrained talking face videos, resulting in significant parts of the video being out-of-sync with the new audio. We identify key reasons pertaining to this and hence resolve them by learning from a powerful lip-sync discriminator. Next, we propose new, rigorous evaluation benchmarks and metrics to accurately measure lip synchronization in unconstrained videos. Extensive quantitative evaluations on our challenging benchmarks show that the lip-sync accuracy of the videos generated by our Wav2Lip model is almost as good as real synced videos. We provide a demo video clearly showing the substantial impact of our Wav2Lip model, and also publicly release the code, models, and evaluation benchmarks on our website.

Cite

CITATION STYLE

APA

Prajwal, K. R., Mukhopadhyay, R., Namboodiri, V. P., & Jawahar, C. V. (2020). A Lip Sync Expert Is All You Need for Speech to Lip Generation in the Wild. In MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia (pp. 484–492). Association for Computing Machinery, Inc. https://doi.org/10.1145/3394171.3413532

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free