Abstract
Background: Large language models (LLMs) have many clinical applications. However, the comparative performance of different LLMs on orthopedic board style questions remains largely unknown. Methods: Three LLMs, OpenAI’s GPT-4 and GPT-3.5, and Google Bard, were tested on 189 official 2022 Orthopedic In-Training Examination (OITE) questions. Comparative analyses were conducted to assess their performance against orthopedic resident scores and on higher-order, image-associated, and subject category-specific questions. Results: GPT-4 surpassed the passing threshold for the 2022 OITE, performing at the level of PGY-3 to PGY-5 (p =.149, p =.502, and p =.818, respectively) and outperforming GPT-3.5 and Bard (p
Author supplied keywords
Cite
CITATION STYLE
Xu, A. Y., Singh, M., Balmaceno-Criss, M., Oh, A., Leigh, D., Daher, M., … Daniels, A. H. (2025). Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examination. Journal of Orthopaedic Surgery, 33(1). https://doi.org/10.1177/10225536241268789
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.