Synthetic Speech Detection through Audio Folding

Davide Salvi; Paolo Bestagini; Stefano Tubaro

Conference ProceedingsOPEN ACCESS

Synthetic Speech Detection through Audio Folding

ACM International Conference Proceeding Series (2023) 3-9

DOI: 10.1145/3592572.3592844

3Citations

12Readers

Abstract

In the field of synthetic speech generation, recent advancements in deep learning and speech synthesis methods have enabled the possibility of creating highly realistic fake speech tracks that are difficult to distinguish from real ones. Since the malicious use of these data can lead to dangerous consequences, the audio forensics community has focused on developing synthetic speech detectors to determine the authenticity of speech tracks. In this work we focus on the wide class of detectors that analyze audio streams on a frame-by-frame basis. We propose a technique to reduce the inference time of these detectors by relying on the fact that it is possible to mix multiple audio frames in a single one (i.e., in the same way a mono track is obtained from a stereo one). We test the proposed audio folding technique on speech tracks obtained from the ASVspoof 2019 dataset. The technique proves effective with both entirely and partially fake speech tracks and shows remarkable results, reducing processing time down to 25%.

Author supplied keywords

Cite

CITATION STYLE

APA

Salvi, D., Bestagini, P., & Tubaro, S. (2023). Synthetic Speech Detection through Audio Folding. In ACM International Conference Proceeding Series (pp. 3–9). Association for Computing Machinery. https://doi.org/10.1145/3592572.3592844

Synthetic Speech Detection through Audio Folding

Abstract

Author supplied keywords

Cite

Register to see more suggestions