ChatGPT Performance on 120 Interdisciplinary Allergology Questions—Systematic Evaluation With Clinical Error Impact Assessment for Critical Erroneous AI-Guided Chatbot Advice

Sonja Mathes; Sebastian Seurig; Friederike Bluhme; Kirsten Beyer; Fabian Heizmann; Manfred Wagner; Ina Neugärtner; Tilo Biedermann; Ulf Darsow

Journal ArticleOPEN ACCESS

ChatGPT Performance on 120 Interdisciplinary Allergology Questions—Systematic Evaluation With Clinical Error Impact Assessment for Critical Erroneous AI-Guided Chatbot Advice

Journal of Allergy and Clinical Immunology: In Practice (2025) 13(6) 1350-1357.e4

DOI: 10.1016/j.jaip.2025.03.030

4Citations

52Readers

Abstract

Background: ChatGPT (Chatbot with Generative Pretrained Transformer), despite not being a medical device, may be used by patients for medical inquiries. Its accessibility and convenience, particularly amidst long waiting times for allergology appointments, make it an attractive but potentially erroneous source of advice. Objectives: This study evaluates ChatGPT's performance on allergological questions from clinical practice, offering a systematic approach to rating its errors. An Allergological Error Impact Assessment is proposed to analyze the potential consequences of these errors on patients. Methods: A total of 120 multidisciplinary allergology questions from dermatology, pediatrics, and pulmonology were prompted to ChatGPT (3.5). Errors were assessed in terms of content, accuracy (ACC), completeness (CO), perceived humanness (PHU), and readability (Flesch Reading Ease). Erroneous responses were categorized on a 3-step severity scale (minor, major, and critical). Critical errors underwent allergological error impact analysis. Statistical evaluation included descriptive analyses and Kruskal-Wallis and Mann-Whitney U tests. Results: ChatGPT demonstrated good accuracy (mean ACC 4.1/5, standard deviation: 0.78, range: 1-5). CO and PHU were sufficient but lowest for pediatric queries. Readability was at an academic level for most responses. Six critical errors were identified: 1 in dermatology, 2 in pediatrics, and 3 in pulmonology. Notably, a critical pediatric food allergen error carried a potentially life-threatening risk. Conclusion: ChatGPT's imperfect reliability in allergology highlights the need for expert counseling in specialized fields. Tailoring these tools to allergy use cases could improve utility of models like ChatGPT for clinical applications, such as answering questions from allergological routine care.

Author supplied keywords

Cite

CITATION STYLE

APA

Mathes, S., Seurig, S., Bluhme, F., Beyer, K., Heizmann, F., Wagner, M., … Darsow, U. (2025). ChatGPT Performance on 120 Interdisciplinary Allergology Questions—Systematic Evaluation With Clinical Error Impact Assessment for Critical Erroneous AI-Guided Chatbot Advice. Journal of Allergy and Clinical Immunology: In Practice, 13(6), 1350-1357.e4. https://doi.org/10.1016/j.jaip.2025.03.030

ChatGPT Performance on 120 Interdisciplinary Allergology Questions—Systematic Evaluation With Clinical Error Impact Assessment for Critical Erroneous AI-Guided Chatbot Advice

Abstract

Author supplied keywords

Cite

Register to see more suggestions