Abstract
Background: As advancements in artificial intelligence (AI) continue, large language models (LLMs) have emerged as promising tools for generating medical information. Their rapid adaptation and potential benefits in health care require rigorous assessment in terms of the quality, accuracy, and safety of the generated information across diverse medical specialties. Objective: This study aimed to evaluate the performance of 4 prominent LLMs, namely, Claude-instant-v1.0, GPT-3.5-Turbo, Command-xlarge-nightly, and Bloomz, in generating medical content spanning the clinical specialties of ophthalmology, orthopedics, and dermatology. Methods: Three domain-specific physicians evaluated the AI-generated therapeutic recommendations for a diverse set of 60 diseases. The evaluation criteria involved the mDISCERN score, correctness, and potential harmfulness of the recommendations. ANOVA and pairwise t tests were used to explore discrepancies in content quality and safety across models and specialties. Additionally, using the capabilities of OpenAI’s most advanced model, GPT-4, an automated evaluation of each model’s responses to the diseases was performed using the same criteria and compared to the physicians’ assessments through Pearson correlation analysis. Results: Claude-instant-v1.0 emerged with the highest mean mDISCERN score (3.35, 95% CI 3.23-3.46). In contrast, Bloomz lagged with the lowest score (1.07, 95% CI 1.03-1.10). Our analysis revealed significant differences among the models in terms of quality (P
Author supplied keywords
Cite
CITATION STYLE
Wilhelm, T. I., Roos, J., & Kaczmarczyk, R. (2023). Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study. Journal of Medical Internet Research, 25. https://doi.org/10.2196/49324
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.