Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examination

Andrew Y. Xu; Manjot Singh; Mariah Balmaceno-Criss; Allison Oh; David Leigh; Mohammad Daher; Daniel Alsoof; Christopher L. McDonald; Bassel G. Diebo; Alan H. Daniels

Journal ArticleOPEN ACCESS

Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examination

Journal of Orthopaedic Surgery (2025) 33(1)

DOI: 10.1177/10225536241268789

7Citations

40Readers

Abstract

Background: Large language models (LLMs) have many clinical applications. However, the comparative performance of different LLMs on orthopedic board style questions remains largely unknown. Methods: Three LLMs, OpenAI’s GPT-4 and GPT-3.5, and Google Bard, were tested on 189 official 2022 Orthopedic In-Training Examination (OITE) questions. Comparative analyses were conducted to assess their performance against orthopedic resident scores and on higher-order, image-associated, and subject category-specific questions. Results: GPT-4 surpassed the passing threshold for the 2022 OITE, performing at the level of PGY-3 to PGY-5 (p =.149, p =.502, and p =.818, respectively) and outperforming GPT-3.5 and Bard (p

Author supplied keywords

Cite

CITATION STYLE

APA

Xu, A. Y., Singh, M., Balmaceno-Criss, M., Oh, A., Leigh, D., Daher, M., … Daniels, A. H. (2025). Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examination. Journal of Orthopaedic Surgery, 33(1). https://doi.org/10.1177/10225536241268789

Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examination

Abstract

Author supplied keywords

Cite

Register to see more suggestions