Acoustic or Pattern? Speech Spoofing Countermeasure based on Image Pre-training Models

Jingze Lu; Zhuo Li; Yuxiang Zhang; Wenchao Wang; Pengyuan Zhang

Conference ProceedingsOPEN ACCESS

Acoustic or Pattern? Speech Spoofing Countermeasure based on Image Pre-training Models

DDAM 2022 - Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia (2022) 77-84

DOI: 10.1145/3552466.3556524

2Citations

11Readers

Abstract

Traditional speech spoofing countermeasures (CM) typically contain a frontend which extracts two-dimensional feature from the waveform, and a Convolutional Neural Network (CNN) based backend classifier. This pipeline is similiar to an image classification task, in some degree. Pre-training is a widely used paradigm in many fields. Self-supervised pre-trained frontends such as Wav2Vec 2.0 have shown superior improvement in the speech spoofing detection task. However, these pre-trained models are only trained by bonafide utterances. Moreover, acoustic pre-trained frontends can also be used in the text-to-speech (TTS) and voice conversion (VC) task, which reveals that commonalities of speech are learnt by them, rather than discriminative information between real and fake data. The speech spoofing detection task and the image classification task share the same pipeline. Based on the hypothesis that CNNs follow the same pattern in capturing artefacts in these two tasks, we apply an image pre-trained CNN model to detect spoofed utterances, counterintuitively. To supplement the model with potentially missing acoustic features, we concatenate Jitter and Shimmer features to the output embedding. Our proposed CM achieves top-level performance on the ASVspoof 2019 dataset.

Author supplied keywords

Cite

CITATION STYLE

APA

Lu, J., Li, Z., Zhang, Y., Wang, W., & Zhang, P. (2022). Acoustic or Pattern? Speech Spoofing Countermeasure based on Image Pre-training Models. In DDAM 2022 - Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia (pp. 77–84). Association for Computing Machinery, Inc. https://doi.org/10.1145/3552466.3556524

Acoustic or Pattern? Speech Spoofing Countermeasure based on Image Pre-training Models

Abstract

Author supplied keywords

Cite

Register to see more suggestions