Open-world classification in dialog systems require models to detect open intents, while ensuring the quality of in-domain (ID) intent classification. In this work, we revisit methods that leverage distance-based statistics for unsupervised out-of-domain (OOD) detection. We show that despite their superior performance on threshold-independent metrics like AUROC on test-set, threshold values chosen based on the performance on a validation-set do not generalize well to the test-set, thus resulting in substantially lower performance on ID or OOD detection accuracy and F1-scores. Our analysis shows that this lack of generalizability can be successfully mitigated by setting aside a holdout set from validation data for threshold selection (sometimes achieving relative gains as high as 100%). Extensive experiments on seven benchmark datasets show that this fix puts the performance of these methods at par with, or sometimes even better than, the current state-of-the-art OOD detection techniques.
CITATION STYLE
Khosla, S., & Gangadharaiah, R. (2022). Evaluating the Practical Utility of Confidence-score based Techniques for Unsupervised Open-world Intent Classification. In Insights 2022 - 3rd Workshop on Insights from Negative Results in NLP, Proceedings of the Workshop (pp. 18–23). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.insights-1.3
Mendeley helps you to discover research relevant for your work.