Application of Large Language Models in Medical Training Evaluation—Using ChatGPT as a Standardized Patient: Multimetric Assessment

Chenxu Wang; L. Shuhan; Nuoxi Lin; Xinyu Zhang; Ying Han; Xiandi Wang; Di Liu; Xiaomei Tan; P. Dan; L. Kang; Guangwu Qian; Rong Yin

Journal Article

Application of Large Language Models in Medical Training Evaluation—Using ChatGPT as a Standardized Patient: Multimetric Assessment

Journal of Medical Internet Research (2025) 27

DOI: 10.2196/59435

27Citations

76Readers

Get full text

Abstract

Background: With the increasing interest in the application of large language models (LLMs) in the medical field, the feasibility of its potential use as a standardized patient in medical assessment is rarely evaluated. Specifically, we delved into the potential of using ChatGPT, a representative LLM, in transforming medical education by serving as a cost-effective alternative to standardized patients, specifically for history-taking tasks. Objective: The study aims to explore ChatGPT’s viability and performance as a standardized patient, using prompt engineering to refine its accuracy and use in medical assessments. Methods: A 2-phase experiment was conducted. The first phase assessed feasibility by simulating conversations about inflammatory bowel disease (IBD) across 3 quality groups (good, medium, and bad). Responses were categorized based on their relevance and accuracy. Each group consisted of 30 runs, with responses scored to determine whether they were related to the inquiries. For the second phase, we evaluated ChatGPT’s performance against specific criteria, focusing on its anthropomorphism, clinical accuracy, and adaptability. Adjustments were made to prompts based on ChatGPT’s response shortcomings, with a comparative analysis of ChatGPT’s performance between original and revised prompts. A total of 300 runs were conducted and compared against standard reference scores. Finally, the generalizability of the revised prompt was tested using other scripts for another 60 runs, together with the exploration of the impact of the used language on the performance of the chatbot. Results: The feasibility test confirmed ChatGPT’s ability to simulate a standardized patient effectively, differentiating among poor, medium, and good medical inquiries with varying degrees of accuracy. Score differences between the poor (74.7, SD 5.44) and medium (82.67, SD 5.30) inquiry groups (P

Author supplied keywords

Cite

CITATION STYLE

APA

Wang, C., Shuhan, L., Lin, N., Zhang, X., Han, Y., Wang, X., … Yin, R. (2025). Application of Large Language Models in Medical Training Evaluation—Using ChatGPT as a Standardized Patient: Multimetric Assessment. Journal of Medical Internet Research, 27. https://doi.org/10.2196/59435

Application of Large Language Models in Medical Training Evaluation—Using ChatGPT as a Standardized Patient: Multimetric Assessment

Abstract

Author supplied keywords

Cite

Register to see more suggestions