Voicefox: Leveraging Inbuilt Transcription to Enhance the Security of Machine-Human Speaker Verification against Voice Synthesis Attacks

7Citations
Citations of this article
15Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In this paper, we propose Voicefox1, a defense against the threat of automated voice synthesis attacks in machine-based and human-based speaker verification applications. Voicefox is based on a hitherto undiscovered potential of speech-to-text transcription, already built into these applications. Voicefox relies on the premise that while the synthesized samples might be falsely accepted by the speaker verification systems and human listeners, they cannot be transcribed as accurately as a natural human voice by transcribers. Voicefox is not a speaker verification system, but rather an independent module that can be integrated with any speaker verification system to enhance its security against voice synthesis attacks. To test our premise and as an essential pre-requisite for building Voicefox, we ran an extensive study that measures the accuracy of off-the-shelf speech-to-text techniques when confronted with the synthesized samples generated by the state-of-the-art speech synthesis techniques. Our results show that the transcription error rate for the synthesized voices is significantly higher, on average 2-3x, than the error rate for natural voices. This study quantitatively proves our hypothesis that human voices are transcribed more accurately than synthesized voices. We further propose several post-transcription rules in designing Voicefox, including acceptance of transcribed text even if up to a certain number of words are not transcribed correctly, and ignoring the words not available in the reference dictionary. Using these rules, Voicefox can effectively reduce the false rejection rates to as low as 1.20-4.69% depending on the application and the transcriber used, and reduce the false accept rates to 0% for dictionaries with phonetically-distinct words.

Cite

CITATION STYLE

APA

Shirvanian, M., Mohammed, M., Saxena, N., & Anand, S. A. (2020). Voicefox: Leveraging Inbuilt Transcription to Enhance the Security of Machine-Human Speaker Verification against Voice Synthesis Attacks. In ACM International Conference Proceeding Series (pp. 870–883). Association for Computing Machinery. https://doi.org/10.1145/3427228.3427289

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free