Retrieval-based Evaluation for LLMs: A Case Study in Korean Legal QA

5Citations
Citations of this article
22Readers
Mendeley users who have this article in their library.

Abstract

While large language models (LLMs) have demonstrated significant capabilities in text generation, their utilization in areas requiring domain-specific expertise, such as law, must be approached cautiously. This caution is warranted due to the inherent challenges associated with LLM-generated texts, including the potential presence of factual errors. Motivated by this issue, we propose Eval-RAG, a new evaluation method for LLM-generated texts. Unlike existing methods, Eval-RAG evaluates the validity of generated texts based on the related document that are collected by the retriever. In other words, Eval-RAG adopts the idea of retrieval augmented generation (RAG) for the purpose of evaluation. Our experimental results on Korean Legal Question-Answering (QA) tasks show that conventional LLM-based evaluation methods can be better aligned with Lawyers' evaluations, by combining with Eval-RAG. In addition, our qualitative analysis show that Eval-RAG successfully finds the factual errors in LLM-generated texts, while existing evaluation methods cannot.

Cite

CITATION STYLE

APA

Ryu, C., Lee, S., Pang, S., Choi, C., Choi, H., Min, M., & Sohn, J. Y. (2023). Retrieval-based Evaluation for LLMs: A Case Study in Korean Legal QA. In NLLP 2023 - Natural Legal Language Processing Workshop 2023, Proceedings of the Workshop (pp. 132–137). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.nllp-1.13

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free