Abstract
Background: Large language models (LLMs) such as ChatGPT are increasingly explored for clinical decision support. However, their performance in high-stakes emergency scenarios remains underexamined. This study aimed to evaluate ChatGPT’s diagnostic and therapeutic accuracy compared to a board-certified emergency physician across diverse emergency cases. Methods: This comparative study was conducted using 15 standardized emergency scenarios sourced from validated academic platforms (Geeky Medics, Life in the Fast Lane, Emergency Medicine Cases). ChatGPT (GPT-4) and a physician independently evaluated each case based on five predefined parameters: diagnosis, investigations, initial treatment, clinical safety, and decision-making complexity. Cases were scored out of 5. Concordance was categorized as high (5/5), moderate (4/5), or low (≤ 3/5). Wilson confidence intervals (95%) were calculated for each concordance category. Results: ChatGPT achieved high concordance (5/5) in 8 cases (53.3%, 95% CI: 27.6–77.0%), moderate concordance (4/5) in 4 cases (26.7%, CI: 10.3–55.4%), and low concordance (≤ 3/5) in 3 cases (20.0%, CI: 6.0–45.6%). Performance was strongest in structured, protocol-based conditions such as STEMI, DKA, and asthma. Lower performance was observed in complex scenarios like stroke, trauma with shock, and mixed acid-base disturbances. Conclusion: ChatGPT showed strong alignment with emergency physician decisions in structured scenarios but lacked reliability in complex cases. While AI may enhance decision-making and education, it cannot replace the clinical reasoning of human physicians. Its role is best framed as a supportive tool rather than a substitute.
Author supplied keywords
Cite
CITATION STYLE
Gün, M. (2025). Can AI match emergency physicians in managing common emergency cases? A comparative performance evaluation. BMC Emergency Medicine, 25(1). https://doi.org/10.1186/s12873-025-01303-y
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.