Traditional speech spoofing countermeasures (CM) typically contain a frontend which extracts two-dimensional feature from the waveform, and a Convolutional Neural Network (CNN) based backend classifier. This pipeline is similiar to an image classification task, in some degree. Pre-training is a widely used paradigm in many fields. Self-supervised pre-trained frontends such as Wav2Vec 2.0 have shown superior improvement in the speech spoofing detection task. However, these pre-trained models are only trained by bonafide utterances. Moreover, acoustic pre-trained frontends can also be used in the text-to-speech (TTS) and voice conversion (VC) task, which reveals that commonalities of speech are learnt by them, rather than discriminative information between real and fake data. The speech spoofing detection task and the image classification task share the same pipeline. Based on the hypothesis that CNNs follow the same pattern in capturing artefacts in these two tasks, we apply an image pre-trained CNN model to detect spoofed utterances, counterintuitively. To supplement the model with potentially missing acoustic features, we concatenate Jitter and Shimmer features to the output embedding. Our proposed CM achieves top-level performance on the ASVspoof 2019 dataset.
CITATION STYLE
Lu, J., Li, Z., Zhang, Y., Wang, W., & Zhang, P. (2022). Acoustic or Pattern? Speech Spoofing Countermeasure based on Image Pre-training Models. In DDAM 2022 - Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia (pp. 77–84). Association for Computing Machinery, Inc. https://doi.org/10.1145/3552466.3556524
Mendeley helps you to discover research relevant for your work.