Evaluation of natural language generation (NLG) is complex and multi-dimensional. Generated text can be evaluated for fluency, coherence, factuality, or any other dimensions of interest. Most frameworks that perform such multi-dimensional evaluation require training on large manually or synthetically generated datasets. In this paper, we study the efficacy of large language models as multi-dimensional evaluators using in-context learning, obviating the need for large training datasets. Our experiments show that in-context learning-based evaluators are competitive with learned evaluation frameworks for the task of text summarization, establishing state-of-the-art on dimensions such as relevance and factual consistency. We then analyze the effects of factors such as the selection and number of in-context examples on performance. Finally, we study the efficacy of in-context learning-based evaluators in evaluating zero-shot summaries written by large language models such as GPT-3. Our code is available at https://github.com/JainSameer06/ICE.
CITATION STYLE
Jain, S., Keshava, V., Sathyendra, S. M., Fernandes, P., Liu, P., Neubig, G., & Zhou, C. (2023). Multi-Dimensional Evaluation of Text Summarization with In-Context Learning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 8487–8495). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-acl.537
Mendeley helps you to discover research relevant for your work.