Vietnamese Speaker Verification With Mel-Scale Filter Bank Energies and Deep Learning

1Citations
Citations of this article
9Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Mel-Frequency Cepstral Coefficients (MFCCs) have been extensively used as input for many traditional and modern speech processing systems. The power of MFCCs lies in the compact representation of speech signals, which is capable of capturing the essential phonetic content of the speech. However, most of the MFCC energy concentrates on the low-order coefficients, and the flat distribution of high-order MFCC values makes convolutional operators less sensitive to the transient details of the coefficients, which may be important in certain speech processing tasks like speaker recognition. In this paper, we analyze the differences between Mel-scale filter bank energies (MFBEs) and MFCCs, and we show that MFBEs are more effective inputs for deep learning-based Vietnamese speaker verification. MFBEs help deep learning models learn a better speaker representation with a more compact distribution of embedding vectors. Experiments on two Vietnamese speaker verification datasets show that the MFBEs consistently outperform MFCCs in improving the performance of some state-of-the-art deep learning models. The equal error rate (EER) on the Vietnam-Celeb test dataset was reduced by 1.14% with the ResNetSE-34 model and 2.36%, or 51.6% improvement, on the VLSP2021 test dataset with ECAPA-TDNN model and transfer learning.

Cite

CITATION STYLE

APA

Nguyen, T. T. M., Nguyen, D. D., & Luong, C. M. (2024). Vietnamese Speaker Verification With Mel-Scale Filter Bank Energies and Deep Learning. IEEE Access, 12, 150114–150122. https://doi.org/10.1109/ACCESS.2024.3479092

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free