Abstract
Background: Artificial intelligence advancements have enabled large language models to significantly impact radiology education and diagnostic accuracy. Objective: This study evaluates the performance of mainstream large language models, including GPT-4, Claude, Bard, Tongyi Qianwen, and Gemini Pro, in radiology board exams. Methods: A comparative analysis of 150 multiple-choice questions from radiology board exams without images was conducted. Models were assessed on their accuracy for text-based questions and were categorized by cognitive levels and medical specialties using χ2 tests and ANOVA. Results: GPT-4 achieved the highest accuracy (83.3%, 125/150), significantly outperforming all other models. Specifically, Claude achieved an accuracy of 62% (93/150; P
Author supplied keywords
Cite
CITATION STYLE
Wei, B. (2025). Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis. JMIR Medical Education, 11. https://doi.org/10.2196/64284
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.