RealTranS: End-to-End Simultaneous Speech Translation with Convolutional Weighted-Shrinking Transformer

37Citations
Citations of this article
69Readers
Mendeley users who have this article in their library.
Get full text

Abstract

End-to-end simultaneous speech translation (SST), which directly translates speech in one language into text in another language in real-time, is useful in many scenarios but has not been fully investigated. In this work, we propose RealTranS, an end-to-end model for SST. To bridge the modality gap between speech and text, RealTranS gradually downsamples the input speech with interleaved convolution and unidirectional Transformer layers for acoustic modeling, and then maps speech features into text space with a weighted-shrinking operation and a semantic encoder. Besides, to improve the model performance in simultaneous scenarios, we propose a blank penalty to enhance the shrinking quality and a Wait-K-Stride-N strategy to allow local reranking during decoding. Experiments on public and widely-used datasets show that RealTranS with the Wait-K-Stride-N strategy outperforms prior end-to-end models as well as cascaded models in diverse latency settings.

Cite

CITATION STYLE

APA

Zeng, X., Li, L., & Liu, Q. (2021). RealTranS: End-to-End Simultaneous Speech Translation with Convolutional Weighted-Shrinking Transformer. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (pp. 2461–2474). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.findings-acl.218

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free