A Comprehensive Evaluation of Incremental Speech Recognition and Diarization for Conversational AI

14Citations
Citations of this article
84Readers
Mendeley users who have this article in their library.

Abstract

Automatic Speech Recognition (ASR) systems are increasingly powerful and more accurate, but also more numerous with several options existing currently as a service (e.g. Google, IBM, and Microsoft). Currently the most stringent standards for such systems are set within the context of their use in, and for, Conversational AI technology. These systems are expected to operate incrementally in real-time, be responsive, stable, and robust to the pervasive yet peculiar characteristics of conversational speech such as disfluencies and overlaps. In this paper we evaluate the most popular of such systems with metrics and experiments designed with these standards in mind. We also evaluate the speaker diarization (SD) capabilities of the same systems which will be particularly important for dialogue systems designed to handle multi-party interaction. We found that Microsoft has the leading incremental ASR system which preserves disfluent materials and IBM has the leading incremental SD system in addition to the ASR that is most robust to speech overlaps. Google strikes a balance between the two but none of these systems are yet suitable to reliably handle natural spontaneous conversations in real-time.

Cite

CITATION STYLE

APA

Addlesee, A., Yu, Y., & Eshghi, A. (2020). A Comprehensive Evaluation of Incremental Speech Recognition and Diarization for Conversational AI. In COLING 2020 - 28th International Conference on Computational Linguistics, Proceedings of the Conference (pp. 3492–3503). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2020.coling-main.312

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free