UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units

14Citations
Citations of this article
136Readers
Mendeley users who have this article in their library.

Abstract

Direct speech-to-speech translation (S2ST), in which all components can be optimized jointly, is advantageous over cascaded approaches to achieve fast inference with a simplified pipeline. We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units subsequently. We enhance the model performance by subword prediction in the first-pass decoder, advanced two-pass decoder architecture design and search strategy, and better training regularization. To leverage large amounts of unlabeled text data, we pre-train the first-pass text decoder based on the self-supervised denoising auto-encoding task. Experimental evaluations on benchmark datasets at various data scales demonstrate that UnitY outperforms a single-pass speech-to-unit translation model by 2.5-4.2 ASR-BLEU with 2.83× decoding speed-up. We show that the proposed methods boost the performance even when predicting spectrogram in the second pass. However, predicting discrete units achieves 2.51× decoding speed-up compared to that case.

Cite

CITATION STYLE

APA

Inaguma, H., Popuri, S., Kulikov, I., Chen, P. J., Wang, C., Chung, Y. A., … Pino, J. (2023). UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 15655–15680). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.acl-long.872

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free