In this paper, we discuss paradigms for evaluating open-domain semantic interpretation as they apply to the PASCAL Recognizing Textual Entailment (RTE) evaluation (Dagari et al. 2005). We focus on three aspects critical to a successful evaluation: creation of large quantities of reasonably good training data, analysis of inter-annotator agreement, and joint analysis of test item difficulty and test-taker proficiency (Rasch analysis). We found that although RTE does not correspond to a "real" or naturally occurring language processing task, it nonetheless provides clear and simple metrics, a tolerable cost of corpus development, good annotator reliability (with the potential to exploit the remaining variability), and the possibility of finding noisy but plentiful training material. © Springer-Verlag Berlin Heidelberg 2006.
CITATION STYLE
Bayer, S., Burger, J., Ferro, L., Henderson, J., Hirschman, L., & Yeh, A. (2006). Evaluating semantic evaluations: How RTE measures up. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 3944 LNAI, pp. 309–331). Springer Verlag. https://doi.org/10.1007/11736790_18
Mendeley helps you to discover research relevant for your work.