Clinical feasibility of AI Doctors: Evaluating the replacement potential of large language models in outpatient settings for central nervous system tumors

Yifeng Pan; Shen Tian; Jing Guo; Hongqing Cai; Jinghai Wan; Cheng Fang

Journal ArticleOPEN ACCESS

Clinical feasibility of AI Doctors: Evaluating the replacement potential of large language models in outpatient settings for central nervous system tumors

International Journal of Medical Informatics (2025) 203

DOI: 10.1016/j.ijmedinf.2025.106013

7Citations

42Readers

Abstract

Background and Objectives: The treatment of central nervous system (CNS) tumors is complex and resource-intensive, with higher mortality in underserved regions. Large language models (LLMs) show promise in medical support, but their real-world performance in CNS tumor outpatient care remains unclear. This study aims to assess the diagnostic and treatment capabilities of LLMs in bilingual clinical settings. Methods: This retrospective study evaluated three LLMs (ChatGPT-4o, DeepSeek-R1, and Doubao) in assisting neuro-oncology outpatient decision-making within bilingual (Chinese/English) clinical environments. A total of 338 outpatient cases were included, with each model assigned three clinical tasks: differential diagnosis, main diagnosis, and treatment advice. Model outputs were compared against assessments by experienced neurosurgeons. Statistical analysis employed McNemar tests (P < 0.05). Results: ChatGPT-4o and DeepSeek-R1 achieved over 90 % accuracy in differential diagnosis, showing no significant difference compared to doctors (P > 0.05), while Doubao performed significantly worse (Chinese: P = 0.02, English: P = 0.01). In main diagnosis, both ChatGPT-4o and DeepSeek-R1 showed no significant deviation from doctors performance (P > 0.05), whereas Doubao underperformed (Chinese: P = 0.019, English: P = 0.011). For treatment recommendations, all models showed reduced accuracy (ChatGPT-4o: 80.5 %; DeepSeek-R1: 79 %; Doubao: 71.3 %), significantly lower than doctors (Whether in Chinese or English: P < 0.05). No performance difference was observed between Chinese and English cases. Conclusion: LLMs show strong potential in the preliminary diagnosis and decision support for CNS tumors, and their cross-lingual adaptability underscores their clinical feasibility.

Author supplied keywords

Cite

CITATION STYLE

APA

Pan, Y., Tian, S., Guo, J., Cai, H., Wan, J., & Fang, C. (2025). Clinical feasibility of AI Doctors: Evaluating the replacement potential of large language models in outpatient settings for central nervous system tumors. International Journal of Medical Informatics, 203. https://doi.org/10.1016/j.ijmedinf.2025.106013

Clinical feasibility of AI Doctors: Evaluating the replacement potential of large language models in outpatient settings for central nervous system tumors

Abstract

Author supplied keywords

Cite

Register to see more suggestions