The Diagnostic Ability of GPT-3.5 and GPT-4.0 in Surgery: Comparative Analysis

Jiayu Liu; Xiuting Liang; Dandong Fang; Jiqi Zheng; Chengliang Yin; Hui Xie; Yanteng Li; Xiaochun Sun; Yue Tong; Hebin Che; Ping Hu; Fan Yang; Bingxian Wang; Yuanyuan Chen; Gang Cheng; Jianning Zhang

Journal ArticleOPEN ACCESS

The Diagnostic Ability of GPT-3.5 and GPT-4.0 in Surgery: Comparative Analysis

Journal of Medical Internet Research (2024) 26

DOI: 10.2196/54985

16Citations

25Readers

Get full text

Abstract

Background: ChatGPT (OpenAI) has shown great potential in clinical diagnosis and could become an excellent auxiliary tool in clinical practice. This study investigates and evaluates ChatGPT in diagnostic capabilities by comparing the performance of GPT-3.5 and GPT-4.0 across model iterations. Objective: This study aims to evaluate the precise diagnostic ability of GPT-3.5 and GPT-4.0 for colon cancer and its potential as an auxiliary diagnostic tool for surgeons and compare the diagnostic accuracy rates between GTP-3.5 and GPT-4.0. We precisely assess the accuracy of primary and secondary diagnoses and analyze the causes of misdiagnoses in GPT-3.5 and GPT-4.0 according to 7 categories: patient histories, symptoms, physical signs, laboratory examinations, imaging examinations, pathological examinations, and intraoperative findings. Methods: We retrieved 316 case reports for intestinal cancer from the Chinese Medical Association Publishing House database, of which 286 cases were deemed valid after data cleansing. The cases were translated from Mandarin to English and then input into GPT-3.5 and GPT-4.0 using a simple, direct prompt to elicit primary and secondary diagnoses. We conducted a comparative study to evaluate the diagnostic accuracy of GPT-4.0 and GPT-3.5. Three senior surgeons from the General Surgery Department, specializing in Colorectal Surgery, assessed the diagnostic information at the Chinese PLA (People's Liberation Army) General Hospital. The accuracy of primary and secondary diagnoses was scored based on predefined criteria. Additionally, we analyzed and compared the causes of misdiagnoses in both models according to 7 categories: patient histories, symptoms, physical signs, laboratory examinations, imaging examinations, pathological examinations, and intraoperative findings. Results: Out of 286 cases, GPT-4.0 and GPT-3.5 both demonstrated high diagnostic accuracy for primary diagnoses, but the accuracy rates of GPT-4.0 were significantly higher than GPT-3.5 (mean 0.972, SD 0.137 vs mean 0.855, SD 0.335; t285=5.753; P

Author supplied keywords

Cite

CITATION STYLE

APA

Liu, J., Liang, X., Fang, D., Zheng, J., Yin, C., Xie, H., … Zhang, J. (2024). The Diagnostic Ability of GPT-3.5 and GPT-4.0 in Surgery: Comparative Analysis. Journal of Medical Internet Research, 26. https://doi.org/10.2196/54985

The Diagnostic Ability of GPT-3.5 and GPT-4.0 in Surgery: Comparative Analysis

Abstract

Author supplied keywords

Cite

Register to see more suggestions