Assessment of a Large Language Model's Responses to Questions and Cases about Glaucoma and Retina Management

Andy S. Huang; Kyle Hirabayashi; Laura Barna; Deep Parikh; Louis R. Pasquale

Journal ArticleOPEN ACCESS

Assessment of a Large Language Model's Responses to Questions and Cases about Glaucoma and Retina Management

JAMA Ophthalmology (2024) 142(4) 371-375

DOI: 10.1001/jamaophthalmol.2023.6917

36Citations

51Readers

Abstract

Importance: Large language models (LLMs) are revolutionizing medical diagnosis and treatment, offering unprecedented accuracy and ease surpassing conventional search engines. Their integration into medical assistance programs will become pivotal for ophthalmologists as an adjunct for practicing evidence-based medicine. Therefore, the diagnostic and treatment accuracy of LLM-generated responses compared with fellowship-trained ophthalmologists can help assess their accuracy and validate their potential utility in ophthalmic subspecialties. Objective: To compare the diagnostic accuracy and comprehensiveness of responses from an LLM chatbot with those of fellowship-trained glaucoma and retina specialists on ophthalmological questions and real patient case management. Design, Setting, and Participants: This comparative cross-sectional study recruited 15 participants aged 31 to 67 years, including 12 attending physicians and 3 senior trainees, from eye clinics affiliated with the Department of Ophthalmology at Icahn School of Medicine at Mount Sinai, New York, New York. Glaucoma and retina questions (10 of each type) were randomly selected from the American Academy of Ophthalmology's commonly asked questions Ask an Ophthalmologist. Deidentified glaucoma and retinal cases (10 of each type) were randomly selected from ophthalmology patients seen at Icahn School of Medicine at Mount Sinai-affiliated clinics. The LLM used was GPT-4 (version dated May 12, 2023). Data were collected from June to August 2023. Main Outcomes and Measures: Responses were assessed via a Likert scale for medical accuracy and completeness. Statistical analysis involved the Mann-Whitney U test and the Kruskal-Wallis test, followed by pairwise comparison. Results: The combined question-case mean rank for accuracy was 506.2 for the LLM chatbot and 403.4 for glaucoma specialists (n = 831; Mann-Whitney U = 27976.5; P

Cite

CITATION STYLE

APA

Huang, A. S., Hirabayashi, K., Barna, L., Parikh, D., & Pasquale, L. R. (2024). Assessment of a Large Language Model’s Responses to Questions and Cases about Glaucoma and Retina Management. JAMA Ophthalmology, 142(4), 371–375. https://doi.org/10.1001/jamaophthalmol.2023.6917

Assessment of a Large Language Model's Responses to Questions and Cases about Glaucoma and Retina Management

Abstract

Cite

Register to see more suggestions