In the pursuit of a deeper understanding of a model's behaviour, there is recent impetus for developing suites of probes aimed at diagnosing models beyond simple metrics like accuracy or BLEU. This paper takes a step back and asks an important and timely question: how reliable are these diagnostics in providing insight into models and training setups? We critically examine three recent diagnostic tests for pre-trained language models, and find that likelihood-based and representation-based model diagnostics are not yet as reliable as previously assumed. Based on our empirical findings, we also formulate recommendations for practitioners and researchers.
CITATION STYLE
Aribandi, V., Tay, Y., & Metzler, D. (2021). How Reliable are Model Diagnostics? In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (pp. 1778–1785). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.findings-acl.155
Mendeley helps you to discover research relevant for your work.