Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examination

7Citations
Citations of this article
40Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Background: Large language models (LLMs) have many clinical applications. However, the comparative performance of different LLMs on orthopedic board style questions remains largely unknown. Methods: Three LLMs, OpenAI’s GPT-4 and GPT-3.5, and Google Bard, were tested on 189 official 2022 Orthopedic In-Training Examination (OITE) questions. Comparative analyses were conducted to assess their performance against orthopedic resident scores and on higher-order, image-associated, and subject category-specific questions. Results: GPT-4 surpassed the passing threshold for the 2022 OITE, performing at the level of PGY-3 to PGY-5 (p =.149, p =.502, and p =.818, respectively) and outperforming GPT-3.5 and Bard (p

Cite

CITATION STYLE

APA

Xu, A. Y., Singh, M., Balmaceno-Criss, M., Oh, A., Leigh, D., Daher, M., … Daniels, A. H. (2025). Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examination. Journal of Orthopaedic Surgery, 33(1). https://doi.org/10.1177/10225536241268789

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free