Learnable PINs: Cross-modal embeddings for person identity

10Citations
Citations of this article
191Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

We propose and investigate an identity sensitive joint embedding of face and voice. Such an embedding enables cross-modal retrieval from voice to face and from face to voice. We make the following four contributions: first, we show that the embedding can be learnt from videos of talking faces, without requiring any identity labels, using a form of cross-modal self-supervision; second, we develop a curriculum learning schedule for hard negative mining targeted to this task that is essential for learning to proceed successfully; third, we demonstrate and evaluate cross-modal retrieval for identities unseen and unheard during training over a number of scenarios and establish a benchmark for this novel task; finally, we show an application of using the joint embedding for automatically retrieving and labelling characters in TV dramas.

Cite

CITATION STYLE

APA

Nagrani, A., Albanie, S., & Zisserman, A. (2018). Learnable PINs: Cross-modal embeddings for person identity. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11217 LNCS, pp. 73–89). Springer Verlag. https://doi.org/10.1007/978-3-030-01261-8_5

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free