This paper describes CMU’s submission to the IWSLT 2023 simultaneous speech translation shared task for translating English speech to both German text and speech in a streaming fashion. We first build offline speech-to-text (ST) models using the joint CTC/attention framework. These models also use WavLM front-end features and mBART decoder initialization. We adapt our offline ST models for simultaneous speech-to-text translation (SST) by 1) incrementally encoding chunks of input speech, re-computing encoder states for each new chunk and 2) incrementally decoding output text, pruning beam search hypotheses to 1-best after processing each chunk. We then build text-to-speech (TTS) models using the VITS framework and achieve simultaneous speech-to-speech translation (SS2ST) by cascading our SST and TTS models.
CITATION STYLE
Yan, B., Shi, J., Maiti, S., Chen, W., Li, X., Peng, Y., … Watanabe, S. (2023). CMU’s IWSLT 2023 Simultaneous Speech Translation System. In 20th International Conference on Spoken Language Translation, IWSLT 2023 - Proceedings of the Conference (pp. 235–240). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.iwslt-1.20
Mendeley helps you to discover research relevant for your work.