Auxiliary Networks for Joint Speaker Adaptation and Speaker Change Detection

7Citations
Citations of this article
20Readers
Mendeley users who have this article in their library.

Abstract

Speaker adaptation and speaker change detection have both been studied extensively to improve automatic speech recognition (ASR). In many cases, these two problems are investigated separately: speaker change detection is implemented first to obtain single-speaker regions, and speaker adaptation is then performed using the derived speaker segments for improved ASR. However, in an online setting, we want to achieve both goals in a single pass. In this study, we propose a neural network architecture that learns a speaker embedding from which it can perform both speaker adaptation for ASR and speaker change detection. The proposed speaker embedding is computed using self-attention based on an auxiliary network attached to a main ASR network. ASR adaptation is then performed by subtracting, from the main network activations, a segment dependent affine transformation of the learned speaker embedding. In experiments on a broadcast news dataset and the Switchboard conversational dataset, we test our system on utterances with a change point in them and show that the proposed method achieves significantly better performance as compared to the unadapted main network (10-14% relative reduction in word error rate (WER)). The proposed architecture also outperforms three different speaker segmentation methods followed by ASR (around 10% relative reduction in WER).

References Powered by Scopus

Long Short-Term Memory

77583Citations
N/AReaders
Get full text

Front-end factor analysis for speaker verification

3483Citations
N/AReaders
Get full text

X-Vectors: Robust DNN Embeddings for Speaker Recognition

2304Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Augmenting Transformer-Transducer Based Speaker Change Detection with Token-Level Training Loss

8Citations
N/AReaders
Get full text

Attention based gender and nationality information exploration for speaker identification

8Citations
N/AReaders
Get full text

Sequence-Level Speaker Change Detection with Difference-Based Continuous Integrate-and-Fire

4Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Sari, L., Hasegawa-Johnson, M., & Thomas, S. (2021). Auxiliary Networks for Joint Speaker Adaptation and Speaker Change Detection. IEEE/ACM Transactions on Audio Speech and Language Processing, 29, 324–333. https://doi.org/10.1109/TASLP.2020.3040626

Readers over time

‘21‘22‘23‘240481216

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 7

88%

Professor / Associate Prof. 1

13%

Readers' Discipline

Tooltip

Computer Science 6

60%

Engineering 2

20%

Agricultural and Biological Sciences 1

10%

Social Sciences 1

10%

Save time finding and organizing research with Mendeley

Sign up for free
0