Audiovisual Biometric Network with Deep Feature Fusion for Identification and Text Prompted Verification

Juan Carlos Atenco; Juan Carlos Moreno; Juan Manuel Ramirez

Journal ArticleOPEN ACCESS

Audiovisual Biometric Network with Deep Feature Fusion for Identification and Text Prompted Verification

Algorithms (2023) 16(2)

DOI: 10.3390/a16020066

2Citations

12Readers

Abstract

In this work we present a bimodal multitask network for audiovisual biometric recognition. The proposed network performs the fusion of features extracted from face and speech data through a weighted sum to jointly optimize the contribution of each modality, aiming for the identification of a client. The extracted speech features are simultaneously used in a speech recognition task with random digit sequences. Text prompted verification is performed by fusing the scores obtained from the matching of bimodal embeddings with the Word Error Rate (WER) metric calculated from the accuracy of the transcriptions. The score fusion outputs a value that can be compared with a threshold to accept or reject the identity of a client. Training and evaluation was carried out by using our proprietary database BIOMEX-DB and VidTIMIT audiovisual database. Our network achieved an accuracy of 100% and an Equal Error Rate (EER) of 0.44% for identification and verification, respectively, in the best case. To the best of our knowledge, this is the first system that combines the mutually related tasks previously described for biometric recognition.

Author supplied keywords

Cite

CITATION STYLE

APA

Atenco, J. C., Moreno, J. C., & Ramirez, J. M. (2023). Audiovisual Biometric Network with Deep Feature Fusion for Identification and Text Prompted Verification. Algorithms, 16(2). https://doi.org/10.3390/a16020066

Audiovisual Biometric Network with Deep Feature Fusion for Identification and Text Prompted Verification

Abstract

Author supplied keywords

Cite

Register to see more suggestions