Synthetic Speech Detection through Audio Folding

3Citations
Citations of this article
12Readers
Mendeley users who have this article in their library.

Abstract

In the field of synthetic speech generation, recent advancements in deep learning and speech synthesis methods have enabled the possibility of creating highly realistic fake speech tracks that are difficult to distinguish from real ones. Since the malicious use of these data can lead to dangerous consequences, the audio forensics community has focused on developing synthetic speech detectors to determine the authenticity of speech tracks. In this work we focus on the wide class of detectors that analyze audio streams on a frame-by-frame basis. We propose a technique to reduce the inference time of these detectors by relying on the fact that it is possible to mix multiple audio frames in a single one (i.e., in the same way a mono track is obtained from a stereo one). We test the proposed audio folding technique on speech tracks obtained from the ASVspoof 2019 dataset. The technique proves effective with both entirely and partially fake speech tracks and shows remarkable results, reducing processing time down to 25%.

Cite

CITATION STYLE

APA

Salvi, D., Bestagini, P., & Tubaro, S. (2023). Synthetic Speech Detection through Audio Folding. In ACM International Conference Proceeding Series (pp. 3–9). Association for Computing Machinery. https://doi.org/10.1145/3592572.3592844

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free