Automatic Prompt Engineering for Automatic Scoring

2Citations
Citations of this article
7Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Prompts play a crucial role in eliciting accurate outputs from large language models (LLMs). This study examines the effectiveness of an automatic prompt engineering (APE) framework for automatic scoring in educational measurement. We collected constructed-response data from 930 students across 11 items and used human scores as the true labels. A baseline was established by providing LLMs with the original human-scoring instructions and materials. APE was then applied to optimize prompts for each item. We found that on average, APE increased scoring accuracy by 9%; few-shot learning (i.e., giving multiple labeled examples related to the goal) increased APE performance by 2%; a high temperature (i.e., a parameter for output randomness) was needed in at least part of the APE to improve the scoring accuracy; Quadratic Weighted Kappa (QWK) showed a similar pattern. These findings support the use of APE in automatic scoring. Moreover, compared with the manual scoring instructions, APE tended to restate and reformat the scoring prompts, which could give rise to concerns about validity. Thus, the creative variability introduced by LLMs raises considerations about the balance between innovation and adherence to scoring rubrics.

Cite

CITATION STYLE

APA

Xue, M., Liu, Y., Xiao, X., & Wilson, M. (2025). Automatic Prompt Engineering for Automatic Scoring. Journal of Educational Measurement, 62(4), 559–587. https://doi.org/10.1111/jedm.70002

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free