LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition

Jin Xu; Xu Tan; Yi Ren; Tao Qin; Jian Li; Sheng Zhao; Tie Yan Liu

Conference ProceedingsOPEN ACCESS

LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition

Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2020) 2802-2812

DOI: 10.1145/3394486.3403331

56Citations

78Readers

Get full text

Abstract

Speech synthesis (text to speech, TTS) and recognition (automatic speech recognition, ASR) are important speech tasks, and require a large amount of text and speech pairs for model training. However, there are more than 6,000 languages in the world and most languages are lack of speech training data, which poses significant challenges when building TTS and ASR systems for extremely low-resource languages. In this paper, we develop LRSpeech, a TTS and ASR system under the extremely low-resource setting, which can support rare languages with low data cost. LRSpeech consists of three key techniques: 1) pre-training on rich-resource languages and fine-tuning on low-resource languages; 2) dual transformation between TTS and ASR to iteratively boost the accuracy of each other; 3) knowledge distillation to customize the TTS model on a high-quality target-speaker voice and improve the ASR model on multiple voices. We conduct experiments on an experimental language (English) and a truly low-resource language (Lithuanian) to verify the effectiveness of LRSpeech. Experimental results show that LRSpeech 1) achieves high quality for TTS in terms of both intelligibility (more than $98%$ intelligibility rate) and naturalness (above 3.5 mean opinion score (MOS)) of the synthesized speech, which satisfy the requirements for industrial deployment, 2) achieves promising recognition accuracy for ASR, and 3) last but not least, uses extremely low-resource training data. We also conduct comprehensive analyses on LRSpeech with different amounts of data resources, and provide valuable insights and guidances for industrial deployment. We are currently deploying LRSpeech into a commercialized cloud speech service to support TTS on more rare languages.

Author supplied keywords

References Powered by Scopus

View more at Scopus

Cited by Powered by Scopus

View more at Scopus

Cite

CITATION STYLE

APA

Xu, J., Tan, X., Ren, Y., Qin, T., Li, J., Zhao, S., & Liu, T. Y. (2020). LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 2802–2812). Association for Computing Machinery. https://doi.org/10.1145/3394486.3403331

Readers' Seniority

PhD / Post grad / Masters / Doc 20

57%

Researcher 10

29%

Professor / Associate Prof. 3

Lecturer / Post doc 2

Readers' Discipline

Computer Science 31

84%

Engineering 4

11%

Arts and Humanities 1

Environmental Science 1

LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition

Abstract

Author supplied keywords

References Powered by Scopus

Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks

Librispeech: An ASR corpus based on public domain audio books

Effective approaches to attention-based neural machine translation

Cited by Powered by Scopus

A review of deep learning techniques for speech processing

MixSpeech: Data augmentation for low-resource automatic speech recognition

Improving Automatic Speech Recognition Performance for Low-Resource Languages with Self-Supervised Models

Register to see more suggestions

Cite

Readers' Seniority

Readers' Discipline