Validity of evidence-based recommendations by a large language model for interdisciplinary board decisions in neurooncology: An explorative study and critical evaluation

Maria Goldberg; Viktor Maria Eisenkolb; Amir Kaywan Aftahy; Chiara Negwer; Hanno S. Meyer; Jens Gempt; Bernhard Meyer; Arthur Wagner

Journal ArticleOPEN ACCESS

Validity of evidence-based recommendations by a large language model for interdisciplinary board decisions in neurooncology: An explorative study and critical evaluation

Digital Health (2025) 11

DOI: 10.1177/20552076251384604

0Citations

11Readers

Abstract

Objectives: This study aims to evaluate the stylistic and structural equivalence of Artificial Intelligence (AI)-generated summaries, particularly those by Large Language Models (LLMs) like ChatGPT, compared to traditional human-generated case summaries in neuro-oncological board decisions. The primary goal is to explore the stylistic alignment between AI-generated and human-authored summaries from board meeting audio recordings. Methods: The study compares 30 traditional human-generated case summaries with 30 AI-generated summaries based on board meeting audio recordings. Two expert raters, blinded to the source of the summaries, evaluated a total of 60 cases. A Likert scale was used to assess the plausibility, linguistic style, evidence adherence, and reference accuracy of the summaries. Results: The results indicated that both LLM-generated and human-reviewed summaries demonstrated consistently high performance across all criteria evaluated. The general plausibility ratings were comparable (LLM: 4.7, Human: 4.73, P = .959). Linguistic style ratings also showed similarity (LLM: 4.87, Human: 4.97, P = .512). In terms of adherence to evidence, the means were close (LLM: 4.8, Human: 4.87, P = .541). Reference accuracy was slightly higher for AI-generated summaries (LLM: 4.97, Human: 4.9, P = .664). These findings were consistent with the results from Rater 2, and statistical analysis using Kendall's tau showed no significant differences between methods (P > .05). Conclusion: The study finds that LLM-generated summaries can effectively emulate the style and structure of human-authored ones, indicating their promise as an additional tool in neuro-oncology. These AI models can enhance documentation quality and serve as valuable support in clinical settings. While further research is necessary to explore broader applications, LLMs offer exciting potential as a complement to traditional decision-making processes.

Author supplied keywords

Cite

CITATION STYLE

APA

Goldberg, M., Eisenkolb, V. M., Aftahy, A. K., Negwer, C., Meyer, H. S., Gempt, J., … Wagner, A. (2025). Validity of evidence-based recommendations by a large language model for interdisciplinary board decisions in neurooncology: An explorative study and critical evaluation. Digital Health, 11. https://doi.org/10.1177/20552076251384604

Validity of evidence-based recommendations by a large language model for interdisciplinary board decisions in neurooncology: An explorative study and critical evaluation

Abstract

Author supplied keywords

Cite

Register to see more suggestions