Abstract
A Natural Language Understanding (NLU) component can be used in a dialogue system to perform intent classification, returning an N-best list of hypotheses with corresponding confidence estimates. We perform an in-depth evaluation of 5 NLUs, focusing on confidence estimation. We measure and visualize calibration for the 10 best hypotheses on model level and rank level, and also measure classification performance. The results indicate a trade-off between calibration and performance. In particular, Rasa (with Sklearn classifier) had the best calibration but the lowest performance scores, while Watson Assistant had the best performance but a poor calibration.
Cite
CITATION STYLE
Khojah, R., Berman, A., & Larsson, S. (2022). Evaluating N-best Calibration of Natural Language Understanding for Dialogue Systems. In SIGDIAL 2022 - 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue, Proceedings of the Conference (pp. 582–594). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.sigdial-1.54
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.