Evaluation of the Performance of Large Language Models in the Management of Axial Spondyloarthropathy: Analysis of EULAR 2022 Recommendations

1Citations
Citations of this article
8Readers
Mendeley users who have this article in their library.

Abstract

Introduction: Guidelines have great importance in revealing complex and chronic conditions such as axial spondyloarthropathy. The aim of this study is to compare the answers given by various large language models to open-ended questions created from ASAS–EULAR 2022 guidance. Materials and Methods: This was a cross-sectional and comparative study. A total of 15 recommendations in the ASAS–EULAR 2022 guideline were derived directly from their content into open-ended questions in a clinical context. Each question was asked to the ChatGPT-3.5, GPT-4o, and Gemini 2.0 Flash models, and the answers were evaluated with a seven-point Likert system in terms of usability, reliability, Flesch–Kincaid Reading Ease (FKRE) and Flesch–Kincaid Grade Level (FKGL) metrics for readability, Universal Sentence Encoder (USE) and ROUGE-L for semantic and surface-level similarity. The results of different large language models were statistically compared, and p < 0.05 was revealed as statistically significant. Results: Better FKRE and FKGL scores were obtained in the Google Gemini 2.0 program (p < 0.05). Reliability and usefulness scores were significantly higher for ChatGPT-4o and Gemini 2.0 (p < 0.05). ChatGPT-4o yielded significantly higher semantic similarity scores compared to ChatGPT-3.5 (p < 0.05). There was a negative correlation between FKRE and FKGL scores and a positive correlation between reliability and usefulness scores (p < 0.05). Conclusions: It was determined that ChatGPT-4o and Gemini 2.0 programs gave more reliable, useful, and readable answers to open-ended questions derived from the ASAS–EULAR 2022 guidelines. These programs may potentially assist in supporting treatment decision-making in rheumatology in the future.

Cite

CITATION STYLE

APA

Usen, A., & Kuculmez, O. (2025). Evaluation of the Performance of Large Language Models in the Management of Axial Spondyloarthropathy: Analysis of EULAR 2022 Recommendations. Diagnostics, 15(12). https://doi.org/10.3390/diagnostics15121455

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free