In the field of synthetic speech generation, recent advancements in deep learning and speech synthesis methods have enabled the possibility of creating highly realistic fake speech tracks that are difficult to distinguish from real ones. Since the malicious use of these data can lead to dangerous consequences, the audio forensics community has focused on developing synthetic speech detectors to determine the authenticity of speech tracks. In this work we focus on the wide class of detectors that analyze audio streams on a frame-by-frame basis. We propose a technique to reduce the inference time of these detectors by relying on the fact that it is possible to mix multiple audio frames in a single one (i.e., in the same way a mono track is obtained from a stereo one). We test the proposed audio folding technique on speech tracks obtained from the ASVspoof 2019 dataset. The technique proves effective with both entirely and partially fake speech tracks and shows remarkable results, reducing processing time down to 25%.
CITATION STYLE
Salvi, D., Bestagini, P., & Tubaro, S. (2023). Synthetic Speech Detection through Audio Folding. In ACM International Conference Proceeding Series (pp. 3–9). Association for Computing Machinery. https://doi.org/10.1145/3592572.3592844
Mendeley helps you to discover research relevant for your work.