From accuracy to robustness: a comparative study of five advanced large language models on the Chinese dental licensing examination

Jiarui Zhang; Pascal Ubuzima; Gaoyang Huang; En Hsuan Lee; Yuting Wang; Huifang Xu; Lei Xia; Tingting Wu

Journal ArticleOPEN ACCESS

From accuracy to robustness: a comparative study of five advanced large language models on the Chinese dental licensing examination

BMC Oral Health (2025) 25(1)

DOI: 10.1186/s12903-025-07540-8

1Citations

6Readers

Abstract

Background: Large language models (LLMs) have demonstrated considerable promise in various domains, including dentistry. This study aimed to evaluate five advanced LLMs (DeepSeek-R1, GPT-4o, OpenAI o3, GPT-5 Thinking, and Gemini 2.5 Pro) in the context of the Chinese Dental Licensing Examination (CDLE) to explore their potential in dental education and practice. Methods: A total of 600 questions were selected from the official review book provided by the Chinese National Medical Examination Center. All questions, presented in Chinese, were submitted individually to the five LLMs via their web interfaces. The responses were classified as “correct” or “incorrect” using the official answer keys provided by the review book. We analyzed and compared each model’s overall accuracy and accuracy across different subjects and question types using χ2 or Fisher’s exact tests, as appropriate. To assess robustness, 120 of the 600 questions were selected for adversarial testing under two types of perturbations. We employed McNemar’s test to measure each model’s accuracy degradation during adversarial testing. Results: DeepSeek-R1, GPT-5 Thinking, Gemini 2.5 Pro, and OpenAI o3 demonstrated superior performance, significantly surpassing GPT-4o (p < 0.001), with Gemini 2.5 Pro achieving the highest accuracy at 91.67%. Performance varied across dentistry and its sub-disciplines (prosthodontics and oral anatomy), where GPT-4o significantly lagged behind the other four LLMs (p < 0.05). Gemini 2.5 Pro and GPT-5 Thinking outperformed GPT-4o on A1 and B1 question types (p < 0.05). In adversarial testing, all LLMs showed a slight decrease in accuracy, ranging from 1.66% to 5.84%, but the drop was not significant (p > 0.05). Conclusions: Using the CDLE as a benchmark, new-generation LLMs achieved markedly higher accuracy. Furthermore, all models exhibited strong robustness against adversarial perturbations. These findings indicate that advanced LLMs hold promise as assistive tools for dental education and practice.

Author supplied keywords

Cite

CITATION STYLE

APA

Zhang, J., Ubuzima, P., Huang, G., Lee, E. H., Wang, Y., Xu, H., … Wu, T. (2025). From accuracy to robustness: a comparative study of five advanced large language models on the Chinese dental licensing examination. BMC Oral Health, 25(1). https://doi.org/10.1186/s12903-025-07540-8

From accuracy to robustness: a comparative study of five advanced large language models on the Chinese dental licensing examination

Abstract

Author supplied keywords

Cite

Register to see more suggestions