Abstract
Automatic Speech Recognition (ASR) plays a crucial role in voice-based applications. For applications requiring real-time feedback like Voice Search, streaming capability becomes vital. While LSTM/RNN and CTC based ASR systems are commonly employed for low-latency streaming applications, they often exhibit lower accuracy compared to state-of-the-art models due to a lack of future audio frames. In this work, we focus on developing accurate LSTM, attention, and CTC based streaming ASR models for large-scale Hinglish (a blend of Hindi and English) Voice Search. We investigate various modifcations in vanilla LSTM training which enhance the system's accuracy while preserving its streaming capabilities. We also address the critical requirement of end-of-speech (EOS) detection in streaming applications. We present a simple training and inference strategy for end-to-end CTC models that enables joint ASR and EOS detection. The evaluation of our model on Flipkart's Voice Search, which handles substantial traffc of approximately 6 million queries per day, demonstrates signifcant performance gains over the vanilla LSTM-CTC model. Our model achieves a word error rate (WER) of 3.69% without EOS and 4.78% with EOS while also reducing the search latency by approximately ?1300 ms (equivalent to 46.64% reduction) when compared to an independent voice activity detection (VAD) model.
Cite
CITATION STYLE
Goyal, A., & Garera, N. (2023). Building Accurate Low Latency ASR for Streaming Voice Search. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 5, pp. 276–283). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.acl-industry.26
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.