Evaluating semantic evaluations: How RTE measures up

Sam Bayer; John Burger; Lisa Ferro; John Henderson; Lynette Hirschman; Alex Yeh

Conference Proceedings

Evaluating semantic evaluations: How RTE measures up

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2006) 3944 LNAI 309-331

DOI: 10.1007/11736790_18

2Citations

3Readers

Get full text

Abstract

In this paper, we discuss paradigms for evaluating open-domain semantic interpretation as they apply to the PASCAL Recognizing Textual Entailment (RTE) evaluation (Dagari et al. 2005). We focus on three aspects critical to a successful evaluation: creation of large quantities of reasonably good training data, analysis of inter-annotator agreement, and joint analysis of test item difficulty and test-taker proficiency (Rasch analysis). We found that although RTE does not correspond to a "real" or naturally occurring language processing task, it nonetheless provides clear and simple metrics, a tolerable cost of corpus development, good annotator reliability (with the potential to exploit the remaining variability), and the possibility of finding noisy but plentiful training material. © Springer-Verlag Berlin Heidelberg 2006.

Cite

CITATION STYLE

APA

Bayer, S., Burger, J., Ferro, L., Henderson, J., Hirschman, L., & Yeh, A. (2006). Evaluating semantic evaluations: How RTE measures up. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 3944 LNAI, pp. 309–331). Springer Verlag. https://doi.org/10.1007/11736790_18

Evaluating semantic evaluations: How RTE measures up

Abstract

Cite

Register to see more suggestions