Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis

Boxiong Wei

Journal ArticleOPEN ACCESS

Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis

Wei B

JMIR Medical Education (2025) 11

DOI: 10.2196/64284

18Citations

56Readers

Get full text

Abstract

Background: Artificial intelligence advancements have enabled large language models to significantly impact radiology education and diagnostic accuracy. Objective: This study evaluates the performance of mainstream large language models, including GPT-4, Claude, Bard, Tongyi Qianwen, and Gemini Pro, in radiology board exams. Methods: A comparative analysis of 150 multiple-choice questions from radiology board exams without images was conducted. Models were assessed on their accuracy for text-based questions and were categorized by cognitive levels and medical specialties using χ2 tests and ANOVA. Results: GPT-4 achieved the highest accuracy (83.3%, 125/150), significantly outperforming all other models. Specifically, Claude achieved an accuracy of 62% (93/150; P

Author supplied keywords

Cite

CITATION STYLE

APA

Wei, B. (2025). Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis. JMIR Medical Education, 11. https://doi.org/10.2196/64284

Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis

Abstract

Author supplied keywords

Cite

Register to see more suggestions