Evaluating N-best Calibration of Natural Language Understanding for Dialogue Systems

2Citations
Citations of this article
25Readers
Mendeley users who have this article in their library.
Get full text

Abstract

A Natural Language Understanding (NLU) component can be used in a dialogue system to perform intent classification, returning an N-best list of hypotheses with corresponding confidence estimates. We perform an in-depth evaluation of 5 NLUs, focusing on confidence estimation. We measure and visualize calibration for the 10 best hypotheses on model level and rank level, and also measure classification performance. The results indicate a trade-off between calibration and performance. In particular, Rasa (with Sklearn classifier) had the best calibration but the lowest performance scores, while Watson Assistant had the best performance but a poor calibration.

Cite

CITATION STYLE

APA

Khojah, R., Berman, A., & Larsson, S. (2022). Evaluating N-best Calibration of Natural Language Understanding for Dialogue Systems. In SIGDIAL 2022 - 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue, Proceedings of the Conference (pp. 582–594). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.sigdial-1.54

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free