Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study

Theresa Isabelle Wilhelm; Jonas Roos; Robert Kaczmarczyk

Journal ArticleOPEN ACCESS

Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study

Journal of Medical Internet Research (2023) 25

DOI: 10.2196/49324

105Citations

144Readers

Abstract

Background: As advancements in artificial intelligence (AI) continue, large language models (LLMs) have emerged as promising tools for generating medical information. Their rapid adaptation and potential benefits in health care require rigorous assessment in terms of the quality, accuracy, and safety of the generated information across diverse medical specialties. Objective: This study aimed to evaluate the performance of 4 prominent LLMs, namely, Claude-instant-v1.0, GPT-3.5-Turbo, Command-xlarge-nightly, and Bloomz, in generating medical content spanning the clinical specialties of ophthalmology, orthopedics, and dermatology. Methods: Three domain-specific physicians evaluated the AI-generated therapeutic recommendations for a diverse set of 60 diseases. The evaluation criteria involved the mDISCERN score, correctness, and potential harmfulness of the recommendations. ANOVA and pairwise t tests were used to explore discrepancies in content quality and safety across models and specialties. Additionally, using the capabilities of OpenAI’s most advanced model, GPT-4, an automated evaluation of each model’s responses to the diseases was performed using the same criteria and compared to the physicians’ assessments through Pearson correlation analysis. Results: Claude-instant-v1.0 emerged with the highest mean mDISCERN score (3.35, 95% CI 3.23-3.46). In contrast, Bloomz lagged with the lowest score (1.07, 95% CI 1.03-1.10). Our analysis revealed significant differences among the models in terms of quality (P

Cite

CITATION STYLE

APA

Wilhelm, T. I., Roos, J., & Kaczmarczyk, R. (2023). Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study. Journal of Medical Internet Research, 25. https://doi.org/10.2196/49324

Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study

Abstract

Author supplied keywords

Cite

Register to see more suggestions