The goal of this work is to determine the audio-video synchronisation between mouth motion and speech in a video. We propose a two-stream ConvNet architecture that enables the mapping between the sound and the mouth images to be trained end-to-end from unlabelled data. The trained network is used to determine the lipsync error in a video. We apply the network to two further tasks: active speaker detection and lip reading. On both tasks we set a new state-of-the-art on standard benchmark datasets.
CITATION STYLE
Chung, J. S., & Zisserman, A. (2017). Out of time: Automated lip sync in the wild. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10117 LNCS, pp. 251–263). Springer Verlag. https://doi.org/10.1007/978-3-319-54427-4_19
Mendeley helps you to discover research relevant for your work.