From accuracy to robustness: a comparative study of five advanced large language models on the Chinese dental licensing examination

1Citations
Citations of this article
6Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Background: Large language models (LLMs) have demonstrated considerable promise in various domains, including dentistry. This study aimed to evaluate five advanced LLMs (DeepSeek-R1, GPT-4o, OpenAI o3, GPT-5 Thinking, and Gemini 2.5 Pro) in the context of the Chinese Dental Licensing Examination (CDLE) to explore their potential in dental education and practice. Methods: A total of 600 questions were selected from the official review book provided by the Chinese National Medical Examination Center. All questions, presented in Chinese, were submitted individually to the five LLMs via their web interfaces. The responses were classified as “correct” or “incorrect” using the official answer keys provided by the review book. We analyzed and compared each model’s overall accuracy and accuracy across different subjects and question types using χ2 or Fisher’s exact tests, as appropriate. To assess robustness, 120 of the 600 questions were selected for adversarial testing under two types of perturbations. We employed McNemar’s test to measure each model’s accuracy degradation during adversarial testing. Results: DeepSeek-R1, GPT-5 Thinking, Gemini 2.5 Pro, and OpenAI o3 demonstrated superior performance, significantly surpassing GPT-4o (p < 0.001), with Gemini 2.5 Pro achieving the highest accuracy at 91.67%. Performance varied across dentistry and its sub-disciplines (prosthodontics and oral anatomy), where GPT-4o significantly lagged behind the other four LLMs (p < 0.05). Gemini 2.5 Pro and GPT-5 Thinking outperformed GPT-4o on A1 and B1 question types (p < 0.05). In adversarial testing, all LLMs showed a slight decrease in accuracy, ranging from 1.66% to 5.84%, but the drop was not significant (p > 0.05). Conclusions: Using the CDLE as a benchmark, new-generation LLMs achieved markedly higher accuracy. Furthermore, all models exhibited strong robustness against adversarial perturbations. These findings indicate that advanced LLMs hold promise as assistive tools for dental education and practice.

Cite

CITATION STYLE

APA

Zhang, J., Ubuzima, P., Huang, G., Lee, E. H., Wang, Y., Xu, H., … Wu, T. (2025). From accuracy to robustness: a comparative study of five advanced large language models on the Chinese dental licensing examination. BMC Oral Health, 25(1). https://doi.org/10.1186/s12903-025-07540-8

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free