Automatic Prompt Engineering for Automatic Scoring

Mingfeng Xue; Yunting Liu; Xingyao Xiao; Mark Wilson

Journal Article

Automatic Prompt Engineering for Automatic Scoring

Journal of Educational Measurement (2025) 62(4) 559-587

DOI: 10.1111/jedm.70002

2Citations

7Readers

Get full text

Abstract

Prompts play a crucial role in eliciting accurate outputs from large language models (LLMs). This study examines the effectiveness of an automatic prompt engineering (APE) framework for automatic scoring in educational measurement. We collected constructed-response data from 930 students across 11 items and used human scores as the true labels. A baseline was established by providing LLMs with the original human-scoring instructions and materials. APE was then applied to optimize prompts for each item. We found that on average, APE increased scoring accuracy by 9%; few-shot learning (i.e., giving multiple labeled examples related to the goal) increased APE performance by 2%; a high temperature (i.e., a parameter for output randomness) was needed in at least part of the APE to improve the scoring accuracy; Quadratic Weighted Kappa (QWK) showed a similar pattern. These findings support the use of APE in automatic scoring. Moreover, compared with the manual scoring instructions, APE tended to restate and reformat the scoring prompts, which could give rise to concerns about validity. Thus, the creative variability introduced by LLMs raises considerations about the balance between innovation and adherence to scoring rubrics.

Cite

CITATION STYLE

APA

Xue, M., Liu, Y., Xiao, X., & Wilson, M. (2025). Automatic Prompt Engineering for Automatic Scoring. Journal of Educational Measurement, 62(4), 559–587. https://doi.org/10.1111/jedm.70002

Automatic Prompt Engineering for Automatic Scoring

Abstract

Cite

Register to see more suggestions