Comparison of Multiple State-of-the-Art Large Language Models for Patient Education Prior to CT and MRI Examinations

Semil Eminovic; Bogdan Levita; Andrea Dell’Orco; Jonas Alexander Leppig; Jawed Nawabi; Tobias Penzkofer

Journal ArticleOPEN ACCESS

Comparison of Multiple State-of-the-Art Large Language Models for Patient Education Prior to CT and MRI Examinations

Journal of Personalized Medicine (2025) 15(6)

DOI: 10.3390/jpm15060235

0Citations

13Readers

Abstract

Background/Objectives: This study compares the accuracy of responses from state-of-the-art large language models (LLMs) to patient questions before CT and MRI imaging. We aim to demonstrate the potential of LLMs in improving workflow efficiency, while also highlighting risks such as misinformation. Methods: There were 57 CT-related and 64 MRI-related patient questions displayed to ChatGPT-4o, Claude 3.5 Sonnet, Google Gemini, and Mistral Large 2. Each answer was evaluated by two board-certified radiologists and scored for accuracy/correctness/likelihood to mislead using a 5-point Likert scale. Statistics compared LLM performance across question categories. Results: ChatGPT-4o achieved the highest average scores for CT-related questions and tied with Claude 3.5 Sonnet for MRI-related questions, with higher scores across all models for MRI (ChatGPT-4o: CT [4.52 (± 0.46)], MRI: [4.79 (± 0.37)]; Google Gemini: CT [4.44 (± 0.58)]; MRI [4.68 (± 0.58)]; Claude 3.5 Sonnet: CT [4.40 (± 0.59)]; MRI [4.79 (± 0.37)]; Mistral Large 2: CT [4.25 (± 0.54)]; MRI [4.74 (± 0.47)]). At least one response per LLM was rated as inaccurate, with Google Gemini answering most often potentially misleading (in 5.26% for CT and 2.34% for MRI). Mistral Large 2 was outperformed by ChatGPT-4o for all CT-related questions (p < 0.001) and by ChatGPT-4o (p = 0.003), Google Gemini (p = 0.022), and Claude 3.5 Sonnet (p = 0.004) for all CT Contrast media information questions. Conclusions: Even though all LLMs performed well overall and showed great potential for patient education, each model occasionally displayed potentially misleading information, highlighting the clinical application risk.

Author supplied keywords

Cite

CITATION STYLE

APA

Eminovic, S., Levita, B., Dell’Orco, A., Leppig, J. A., Nawabi, J., & Penzkofer, T. (2025). Comparison of Multiple State-of-the-Art Large Language Models for Patient Education Prior to CT and MRI Examinations. Journal of Personalized Medicine, 15(6). https://doi.org/10.3390/jpm15060235

Comparison of Multiple State-of-the-Art Large Language Models for Patient Education Prior to CT and MRI Examinations

Abstract

Author supplied keywords

Cite

Register to see more suggestions