Partitioned Gradient Matching based Data Subset Selection for Compute-Efficient & Robust ASR Training

Ashish Mittal; Durga Sivasubramanian; Rishabh Iyer; Preethi Jyothi; Ganesh Ramakrishnan

Conference ProceedingsOPEN ACCESS

Partitioned Gradient Matching based Data Subset Selection for Compute-Efficient & Robust ASR Training

Findings of the Association for Computational Linguistics: EMNLP 2022 (2022) 6028-6039

DOI: 10.18653/v1/2022.findings-emnlp.443

1Citations

20Readers

Abstract

Training state-of-the-art ASR systems such as RNN-T often have a high associated financial and environmental cost. Training with a subset of training data could mitigate this problem if the subset selected could achieve performance on-par with training with the entire dataset. Although there are many data subset selection (DSS) algorithms, direct application to the RNN-T is difficult, especially the DSS algorithms that are adaptive and use learning dynamics such as gradients, since RNN-T tends to have gradients with a significantly larger memory footprint. In this paper we propose Partitioned Gradient Matching (PGM) a novel distributable DSS algorithm, suitable for massive datasets like those used to train RNN-T. Through extensive experiments on Librispeech 100H and Librispeech 960H, we show that PGM achieves between 3× to 6× speedup with only a very small accuracy degradation (under 1% absolute WER difference). In addition, we demonstrate similar results for PGM even in settings where the training data is corrupted with noise.

Cite

CITATION STYLE

APA

Mittal, A., Sivasubramanian, D., Iyer, R., Jyothi, P., & Ramakrishnan, G. (2022). Partitioned Gradient Matching based Data Subset Selection for Compute-Efficient & Robust ASR Training. In Findings of the Association for Computational Linguistics: EMNLP 2022 (pp. 6028–6039). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.findings-emnlp.443

Partitioned Gradient Matching based Data Subset Selection for Compute-Efficient & Robust ASR Training

Abstract

Cite

Register to see more suggestions