PEARL: A Rubric-Driven Multi-Metric Framework for LLM Evaluation

1Citations
Citations of this article
11Readers
Mendeley users who have this article in their library.

Abstract

Background and objectives: Evaluating Large Language Models (LLMs) presents two interrelated challenges: the general problem of assessing model performance across diverse tasks and the specific problem of using LLMs themselves as evaluators in pedagogical and educational contexts. Existing approaches often rely on single metrics or opaque preference-based methods, which fail to capture critical dimensions such as explanation quality, robustness, and argumentative diversity—attributes essential in instructional settings. This paper introduces PEARL, a novel framework conceived, operationalized, and evaluated in the present work using LLM-based scorers, designed to provide interpretable, reproducible, and pedagogically meaningful assessments across multiple performance dimensions. Methods: PEARL integrates three specialized rubrics—Technical, Argumentative, And Explanation-focused—covering aspects such as factual accuracy, clarity, completeness, originality, dialecticality, and explanatory usefulness. The framework defines seven complementary metrics: Rubric Win Count (RWC), Global Win Rate (GWR), Rubric Mean Advantage (RMA), Consistency Spread (CS), Win Confidence Score (WCS), Explanation Quality Index (EQI), and Dialectical Presence Rate (DPR). We evaluated PEARL by evaluating eight open-weight instruction-tuned LLMs across 51 prompts, with outputs scored independently by GPT-4 and LLaMA 3:instruct. This constitutes LLM-based evaluation, and observed alignment with the GPT-4 proxy is mixed across metrics. Results: Preference-based metrics (RMA, RWC, and GWR) show evidence of group separation, reported with bootstrap confidence intervals and interpreted as exploratory due to small samples, while robustness-oriented (CS and WCS) and reasoning-diversity (DPR) metrics capture complementary aspects of performance not reflected in global win rate. RMA and RWC exhibit statistically significant, FDR-controlled correlations with the GPT-4 proxy, and correlation mapping highlights the complementary and partially orthogonal nature of PEARL’s evaluation dimensions. Originality: PEARL is the first LLM evaluation framework to combine multi-rubric scoring, explanation-aware metrics, robustness analysis, and multi-LLM-evaluator analysis into a single, extensible system. Its multidimensional design supports both high-level benchmarking and targeted diagnostic assessment, offering a rigorous, transparent, and versatile methodology for researchers, developers, and educators working with LLMs in high-stakes and instructional contexts.

Cite

CITATION STYLE

APA

Anghel, C., Anghel, A. A., Pecheanu, E., Craciun, M. V., Cocu, A., & Niculita, C. (2025). PEARL: A Rubric-Driven Multi-Metric Framework for LLM Evaluation. Information (Switzerland), 16(11). https://doi.org/10.3390/info16110926

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free