Speaker adaptation and speaker change detection have both been studied extensively to improve automatic speech recognition (ASR). In many cases, these two problems are investigated separately: speaker change detection is implemented first to obtain single-speaker regions, and speaker adaptation is then performed using the derived speaker segments for improved ASR. However, in an online setting, we want to achieve both goals in a single pass. In this study, we propose a neural network architecture that learns a speaker embedding from which it can perform both speaker adaptation for ASR and speaker change detection. The proposed speaker embedding is computed using self-attention based on an auxiliary network attached to a main ASR network. ASR adaptation is then performed by subtracting, from the main network activations, a segment dependent affine transformation of the learned speaker embedding. In experiments on a broadcast news dataset and the Switchboard conversational dataset, we test our system on utterances with a change point in them and show that the proposed method achieves significantly better performance as compared to the unadapted main network (10-14% relative reduction in word error rate (WER)). The proposed architecture also outperforms three different speaker segmentation methods followed by ASR (around 10% relative reduction in WER).
CITATION STYLE
Sari, L., Hasegawa-Johnson, M., & Thomas, S. (2021). Auxiliary Networks for Joint Speaker Adaptation and Speaker Change Detection. IEEE/ACM Transactions on Audio Speech and Language Processing, 29, 324–333. https://doi.org/10.1109/TASLP.2020.3040626
Mendeley helps you to discover research relevant for your work.