Background: We assessed 2 versions of the large language model (LLM) ChatGPT—versions 3.5 and 4.0—in generating appropriate, consistent, and readable recommendations on core critical care topics. Research Question: How do successive large language models compare in terms of generating appropriate, consistent, and readable recommendations on core critical care topics? Design and Methods: A set of 50 LLM-generated responses to clinical questions were evaluated by 2 independent intensivists based on a 5-point Likert scale for appropriateness, consistency, and readability. Results: ChatGPT 4.0 showed significantly higher median appropriateness scores compared to ChatGPT 3.5 (4.0 vs 3.0, P
CITATION STYLE
Balta, K. Y., Javidan, A. P., Walser, E., Arntfield, R., & Prager, R. (2024). Evaluating the Appropriateness, Consistency, and Readability of ChatGPT in Critical Care Recommendations. Journal of Intensive Care Medicine. https://doi.org/10.1177/08850666241267871
Mendeley helps you to discover research relevant for your work.