Evaluation of the Performance of Large Language Models in the Management of Axial Spondyloarthropathy: Analysis of EULAR 2022 Recommendations

Ahmet Usen; Ozlem Kuculmez

Journal ArticleOPEN ACCESS

Evaluation of the Performance of Large Language Models in the Management of Axial Spondyloarthropathy: Analysis of EULAR 2022 Recommendations

Diagnostics (2025) 15(12)

DOI: 10.3390/diagnostics15121455

1Citations

8Readers

Abstract

Introduction: Guidelines have great importance in revealing complex and chronic conditions such as axial spondyloarthropathy. The aim of this study is to compare the answers given by various large language models to open-ended questions created from ASAS–EULAR 2022 guidance. Materials and Methods: This was a cross-sectional and comparative study. A total of 15 recommendations in the ASAS–EULAR 2022 guideline were derived directly from their content into open-ended questions in a clinical context. Each question was asked to the ChatGPT-3.5, GPT-4o, and Gemini 2.0 Flash models, and the answers were evaluated with a seven-point Likert system in terms of usability, reliability, Flesch–Kincaid Reading Ease (FKRE) and Flesch–Kincaid Grade Level (FKGL) metrics for readability, Universal Sentence Encoder (USE) and ROUGE-L for semantic and surface-level similarity. The results of different large language models were statistically compared, and p < 0.05 was revealed as statistically significant. Results: Better FKRE and FKGL scores were obtained in the Google Gemini 2.0 program (p < 0.05). Reliability and usefulness scores were significantly higher for ChatGPT-4o and Gemini 2.0 (p < 0.05). ChatGPT-4o yielded significantly higher semantic similarity scores compared to ChatGPT-3.5 (p < 0.05). There was a negative correlation between FKRE and FKGL scores and a positive correlation between reliability and usefulness scores (p < 0.05). Conclusions: It was determined that ChatGPT-4o and Gemini 2.0 programs gave more reliable, useful, and readable answers to open-ended questions derived from the ASAS–EULAR 2022 guidelines. These programs may potentially assist in supporting treatment decision-making in rheumatology in the future.

Author supplied keywords

Cite

CITATION STYLE

APA

Usen, A., & Kuculmez, O. (2025). Evaluation of the Performance of Large Language Models in the Management of Axial Spondyloarthropathy: Analysis of EULAR 2022 Recommendations. Diagnostics, 15(12). https://doi.org/10.3390/diagnostics15121455

Evaluation of the Performance of Large Language Models in the Management of Axial Spondyloarthropathy: Analysis of EULAR 2022 Recommendations

Abstract

Author supplied keywords

Cite

Register to see more suggestions