Auxiliary Networks for Joint Speaker Adaptation and Speaker Change Detection

Leda Sari; Mark Hasegawa-Johnson; Samuel Thomas

Journal ArticleOPEN ACCESS

Auxiliary Networks for Joint Speaker Adaptation and Speaker Change Detection

IEEE/ACM Transactions on Audio Speech and Language Processing (2021) 29 324-333

DOI: 10.1109/TASLP.2020.3040626

7Citations

20Readers

Abstract

Speaker adaptation and speaker change detection have both been studied extensively to improve automatic speech recognition (ASR). In many cases, these two problems are investigated separately: speaker change detection is implemented first to obtain single-speaker regions, and speaker adaptation is then performed using the derived speaker segments for improved ASR. However, in an online setting, we want to achieve both goals in a single pass. In this study, we propose a neural network architecture that learns a speaker embedding from which it can perform both speaker adaptation for ASR and speaker change detection. The proposed speaker embedding is computed using self-attention based on an auxiliary network attached to a main ASR network. ASR adaptation is then performed by subtracting, from the main network activations, a segment dependent affine transformation of the learned speaker embedding. In experiments on a broadcast news dataset and the Switchboard conversational dataset, we test our system on utterances with a change point in them and show that the proposed method achieves significantly better performance as compared to the unadapted main network (10-14% relative reduction in word error rate (WER)). The proposed architecture also outperforms three different speaker segmentation methods followed by ASR (around 10% relative reduction in WER).

Author supplied keywords

References Powered by Scopus

View more at Scopus

Cited by Powered by Scopus

View more at Scopus

Cite

CITATION STYLE

APA

Sari, L., Hasegawa-Johnson, M., & Thomas, S. (2021). Auxiliary Networks for Joint Speaker Adaptation and Speaker Change Detection. IEEE/ACM Transactions on Audio Speech and Language Processing, 29, 324–333. https://doi.org/10.1109/TASLP.2020.3040626

Readers over time

Readers' Seniority

PhD / Post grad / Masters / Doc 7

88%

Professor / Associate Prof. 1

13%

Readers' Discipline

Computer Science 6

60%

Engineering 2

20%

Agricultural and Biological Sciences 1

10%

Social Sciences 1

10%

Auxiliary Networks for Joint Speaker Adaptation and Speaker Change Detection

Abstract

Author supplied keywords

References Powered by Scopus

Long Short-Term Memory

Front-end factor analysis for speaker verification

X-Vectors: Robust DNN Embeddings for Speaker Recognition

Cited by Powered by Scopus

Augmenting Transformer-Transducer Based Speaker Change Detection with Token-Level Training Loss

Attention based gender and nationality information exploration for speaker identification

Sequence-Level Speaker Change Detection with Difference-Based Continuous Integrate-and-Fire

Register to see more suggestions

Cite

Readers over time

Readers' Seniority

Readers' Discipline