Comparative performance evaluation of large language models in answering esophageal cancer-related questions: a multi-model assessment study

Zijie He; Lilan Zhao; Genglin Li; Jintao Wang; Songyu Cai; Pengjie Tu; Jingbo Chen; Jianman Wu; Juan Zhang; Ruiqi Chen; Yangyun Huang; Xiaojie Pan; Wenshu Chen

Journal ArticleOPEN ACCESS

Comparative performance evaluation of large language models in answering esophageal cancer-related questions: a multi-model assessment study

Frontiers in Digital Health (2025) 7

DOI: 10.3389/fdgth.2025.1670510

2Citations

25Readers

Abstract

Background: Esophageal cancer has high incidence and mortality rates, leading to increased public demand for accurate information. However, the reliability of online medical information is often questionable. This study systematically compared the accuracy, completeness, and comprehensibility of mainstream large language models (LLMs) in answering esophageal cancer-related questions. Methods: In total, 65 questions covering fundamental knowledge, preoperative preparation, surgical treatment, and postoperative management were selected. Each model, namely, ChatGPT 5, Claude Sonnet 4.0, DeepSeek-R1, Gemini 2.5 Pro, and Grok-4, was queried independently using standardized prompts. Five senior clinical experts, including three thoracic surgeons, one radiologist, and one medical oncologist, evaluated the responses using a five-point Likert scale. A retesting mechanism was applied for the low-scoring responses, and intraclass correlation coefficients were used to assess the rating consistency. The statistical analyses were conducted using the Friedman test, the Wilcoxon signed-rank test, and the Bonferroni correction. Results: All the models performed well, with average scores exceeding 4.0. However, the following significant differences emerged: Gemini excelled in accuracy, while ChatGPT led in completeness, particularly in surgical and postoperative contexts. Minor differences appeared in fundamental knowledge, but notable disparities were found in complex areas. Retesting showed improvements in overall quality, yet some responses showed decreased completeness and relevance. Conclusion: Large language models have considerable potential in answering questions about esophageal cancer, with significant differences in completeness. ChatGPT is more comprehensive in complex scenarios, while Gemini excels in accuracy. This study offers guidance for selecting artificial intelligence tools in clinical settings, advocating for a tiered application strategy tailored to specific scenarios and highlighting the importance of user education to understand the limitations and applicability of LLMs.

Author supplied keywords

Cite

CITATION STYLE

APA

He, Z., Zhao, L., Li, G., Wang, J., Cai, S., Tu, P., … Chen, W. (2025). Comparative performance evaluation of large language models in answering esophageal cancer-related questions: a multi-model assessment study. Frontiers in Digital Health, 7. https://doi.org/10.3389/fdgth.2025.1670510

Comparative performance evaluation of large language models in answering esophageal cancer-related questions: a multi-model assessment study

Abstract

Author supplied keywords

Cite

Register to see more suggestions