Comparative evaluation of large language models in delivering guideline-compliant recommendations for topical NSAID use in musculoskeletal pain: a multidimensional analysis

2Citations
Citations of this article
27Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Introduction: While large language models (LLMs) are increasingly used in clinical decision support, their adherence to evidence-based guidelines—particularly for musculoskeletal pain management—remains understudied. Methods: Four LLMs (DeepSeek-R1, ChatGPT-4o, Gemini, Grok-3) were evaluated on their responses to topical NSAID use for musculoskeletal pain through: assessments of response quality (accuracy, over-conclusiveness, supplementary information, and incompleteness), standardized readability metrics (Flesch Reading Ease, Flesch-Kincaid Grade Level), and the PEMAT-P tool to quantify actionability. Results: The four LLMs showed significant variability in accuracy (ANOVA p = 0.045), with Gemini scoring highest (8.33 ± 0.77) and DeepSeek-R1 lowest (7.72 ± 1.52) and in over-conclusiveness (ANOVA p = 0.025), with Grok-3 scoring lowest (4.56 ± 1.42) and ChatGPT-4o highest 6.72 ± 1.49). ChatGPT-4o provided the most supplementary content (6.94 ± 2.29, p = 0.106) and DeepSeek-R1 had the highest incompleteness (5.00 ± 2.52, p = 0.261). All models exceeded recommended readability thresholds (9th–10th grade level), and none met the actionability standard (≤ 33.5%). Conclusions: LLMs demonstrate potential as clinical aids. The comprehensive performance of Gemini and Grok is relatively favorable, yet their readability and actionability remain unsatisfactory. Future development should integrate clinician feedback and real-world validation to ensure safety. Human oversight and targeted AI training are critical for safe implementation. (Table presented.)

Cite

CITATION STYLE

APA

Dong, C., Qiu, X., Deng, J., Xu, L., Dong, X., Chen, S., … Yu, L. (2025). Comparative evaluation of large language models in delivering guideline-compliant recommendations for topical NSAID use in musculoskeletal pain: a multidimensional analysis. Clinical Rheumatology, 44(11), 4703–4710. https://doi.org/10.1007/s10067-025-07640-4

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free