Hard and Soft Evaluation of NLP models with BOOtSTrap SAmpling - BooStSa

Tommaso Fornaciari; Alexandra Uma; Massimo Poesio; Dirk Hovy

Conference ProceedingsOPEN ACCESS

Hard and Soft Evaluation of NLP models with BOOtSTrap SAmpling - BooStSa

Proceedings of the Annual Meeting of the Association for Computational Linguistics (2022) 127-134

DOI: 10.18653/v1/2022.acl-demo.12

5Citations

30Readers

Abstract

Natural Language Processing (NLP)’s applied nature makes it necessary to select the most effective and robust models. However, just producing slightly higher performance is insufficient; we want to know whether this advantage will carry over to other data sets. Bootstrapped significance tests can indicate that ability. Computing the significance of performance differences has many levels of complexity, though. It can be tedious, especially when the experimental design has many conditions to compare and several runs of experiments. We present BooStSa, a tool that makes it easy to compute significance levels with the BOOtSTrap SAmpling procedure. BooStSa can evaluate models that predict not only standard hard labels but soft labels (i.e., probability distributions over different classes) as well.

Cite

CITATION STYLE

APA

Fornaciari, T., Uma, A., Poesio, M., & Hovy, D. (2022). Hard and Soft Evaluation of NLP models with BOOtSTrap SAmpling - BooStSa. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 127–134). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.acl-demo.12

Hard and Soft Evaluation of NLP models with BOOtSTrap SAmpling - BooStSa

Abstract

Cite

Register to see more suggestions