Comparative analysis of large language models as decision support tools in oral pathology

Valentina Ignacia Alvarez-Silberberg; Victor Gil-Manich; Maria Cuevas-Nunez; Valentina Ignacia Alvarez-Silberberg; Camila Paz Alvarez-Silberberg; Valeria Ramirez; Cristian Bravo Palma; Cosimo Galletti; Cosimo Galletti; Luca Fiorillo; Javier Flores-Fraile; Vini Mehta; Javier Flores-Fraile; Luca Fiorillo; Vini Mehta; Maria Teresa Fernández-Figueras; Maria Teresa Fernández-Figueras; Maria Cuevas-Nunez

Journal ArticleOPEN ACCESS

Comparative analysis of large language models as decision support tools in oral pathology

Scientific reports (2026) 16(1)

DOI: 10.1038/s41598-026-41533-z

1Citations

8Readers

Abstract

This study evaluated the performance of four large language model based chatbots (LLMs) (ChatGPT-4.0, ChatGPT o1-preview, Gemini, and Meta AI) as decision-support systems for interpreting histopathologic descriptions of oral lesions, assessing agreement between their s generated a suggested primary interpretation and three differential diagnoses. Outputs were categorized as Different, Similar, or Correct compared to the consensus reference diagnosis established by two board-certified pathologists. Statistical analyses included the Friedman test to compare model performance, Wilcoxon signed-rank tests for pairwise comparisons, Cohen's κ to assess agreement, and regression analyses to evaluate the influence of age and sex. Differential diagnosis performance was also analyzed. ChatGPT o1-preview demonstrated the highest proportion of outputs concordant with the reference diagnosis (68.6%), followed by Meta AI (65.7%), ChatGPT-4.0 (59.8%), and Gemini (27.5%). In terms of agreement with oral pathologists, ChatGPT o1-preview (κ = 0.66) and Meta AI (κ = 0.63) showed substantial agreement, ChatGPT-4.0 demonstrated moderate agreement (κ = 0.57), and Gemini showed poor agreement (κ = 0.24). Increasing patient age was associated with a mild but statistically significant reduction in model performance for ChatGPT-4.0, Meta AI, and Gemini, while no significant age effect was observed for ChatGPT o1-preview; patient sex had no significant impact. Among the evaluated chatbots, ChatGPT o1-preview showed the highest alignment with oral pathologists' reference diagnoses. These findings support the potential role of LLMs as complementary decision-support tools for interpreting oral histopathology descriptions, while highlighting substantial inter-model variability and the need for cautious implementation with continued human oversight.

Author supplied keywords

Cite

CITATION STYLE

APA

Alvarez-Silberberg, V. I., Gil-Manich, V., Cuevas-Nunez, M., Alvarez-Silberberg, V. I., Alvarez-Silberberg, C. P., Ramirez, V., … Cuevas-Nunez, M. (2026). Comparative analysis of large language models as decision support tools in oral pathology. Scientific Reports, 16(1). https://doi.org/10.1038/s41598-026-41533-z

Comparative analysis of large language models as decision support tools in oral pathology

Abstract

Author supplied keywords

Cite

Register to see more suggestions