Response accuracy of GPT-4 across languages: insights from an expert-level diagnostic radiology examination in Japan

Ayaka Harigai; Yoshitaka Toyama; Mitsutoshi Nagano; Mirei Abe; Masahiro Kawabata; Li Li; Jin Yamamura; Kei Takase

Journal ArticleOPEN ACCESS

Response accuracy of GPT-4 across languages: insights from an expert-level diagnostic radiology examination in Japan

Japanese Journal of Radiology (2025) 43(2) 319-329

DOI: 10.1007/s11604-024-01673-6

9Citations

19Readers

Abstract

Purpose: This study aims to investigate the effects of language selection and translation quality on Generative Pre-trained Transformer-4 (GPT-4)'s response accuracy to expert-level diagnostic radiology questions. Materials and methods: We analyzed 146 diagnostic radiology questions from the Japan Radiology Board Examination (2020–2022), with consensus answers provided by two board-certified radiologists. The questions, originally in Japanese, were translated into English by GPT-4 and DeepL and into German and Chinese by GPT-4. Responses were generated by GPT-4 five times per question set per language. Response accuracy was compared between languages using one-way ANOVA with Bonferroni correction or the Mann–Whitney U test. Scores on selected English questions translated by a professional service and GPT-4 were also compared. The impact of translation quality on GPT-4’s performance was assessed by linear regression analysis. Results: The median scores (interquartile range) for the 146 questions were 70 (68–72) (Japanese), 89 (84.5–95.5) (GPT-4 English), 64 (55.5–67) (Chinese), and 56 (46.5–67.5) (German). Significant differences were found between Japanese and English (p = 0.002) and between Japanese and German (p = 0.022). The counts of correct responses across five attempts for each question were significantly associated with the quality of translation into English (GPT-4, DeepL) and German (GPT-4). In a subset of 31 questions where English translations yielded fewer correct responses than Japanese originals, professionally translated questions yielded better scores than those translated by GPT-4 (13 versus 8 points, p = 0.0079). Conclusion: GPT-4 exhibits higher accuracy when responding to English-translated questions compared to original Japanese questions, a trend not observed with German or Chinese translations. Accuracy improves with higher-quality English translations, underscoring the importance of high-quality translations in improving GPT-4’s response accuracy to diagnostic radiology questions in non-English languages and aiding non-native English speakers in obtaining accurate answers from large language models.

Author supplied keywords

Cite

CITATION STYLE

APA

Harigai, A., Toyama, Y., Nagano, M., Abe, M., Kawabata, M., Li, L., … Takase, K. (2025). Response accuracy of GPT-4 across languages: insights from an expert-level diagnostic radiology examination in Japan. Japanese Journal of Radiology, 43(2), 319–329. https://doi.org/10.1007/s11604-024-01673-6

Response accuracy of GPT-4 across languages: insights from an expert-level diagnostic radiology examination in Japan

Abstract

Author supplied keywords

Cite

Register to see more suggestions