Evaluation of the accuracy and safety of machine translation of patient-specific discharge instructions: a comparative analysis

Marianna Kong; Alicia Fernandez; Jaskaran Bains; Ana Milisavljevic; Katherine C. Brooks; Akash Shanmugam; Leslie Avilez; Junhong Li; Vladyslav Honcharov; Andersen Yang; Elaine C. Khoong

Journal Article

Evaluation of the accuracy and safety of machine translation of patient-specific discharge instructions: a comparative analysis

BMJ Quality and Safety (2025)

DOI: 10.1136/bmjqs-2024-018384

16Citations

47Readers

Get full text

Abstract

Introduction Machine translation of patient-specific information could mitigate language barriers if sufficiently accurate and non-harmful and may be particularly useful in healthcare encounters when professional translators are not readily available. We evaluated the translation accuracy and potential for harm of ChatGPT-4 and Google Translate in translating from English to Spanish, Chinese and Russian. Methods We used ChatGPT-4 and Google Translate to translate 50 sets (316 sentences) of deidentified, patient-specific, clinician free-text emergency department instructions into Spanish, Chinese and Russian. These were then back-translated into English by professional translators and double-coded by physicians for accuracy and potential for clinical harm. Results At the sentence level, we found that both tools were ≥90% accurate in translating English to Spanish (accuracy: GPT 97%, Google Translate 96%) and English to Chinese (accuracy: GPT 95%; Google Translate 90%); neither tool performed as well in translating English to Russian (accuracy: GPT 89%; Google Translate 80%). At the instruction set level, 16%, 24% and 56% of Spanish, Chinese and Russian GPT-translated instruction sets contained at least one inaccuracy. For Google Translate, 24%, 56% and 66% of Spanish, Chinese and Russian translations contained at least one inaccuracy. The potential for harm due to inaccurate translations was ≤1% for both tools in all languages at the sentence level and ≤6% at the instruction set level. GPT was significantly more accurate than Google Translate in Chinese and Russian at the sentence level; the potential for harm was similar. Conclusion These results support the potential of machine translation tools to mitigate gaps in translation services for low-stakes written communication from English to Spanish, while also strengthening the case for caution and for professional oversight in non-low-risk communication. Further research is needed to evaluate machine translation for other languages and more technical content.

Author supplied keywords

Cite

CITATION STYLE

APA

Kong, M., Fernandez, A., Bains, J., Milisavljevic, A., Brooks, K. C., Shanmugam, A., … Khoong, E. C. (2025). Evaluation of the accuracy and safety of machine translation of patient-specific discharge instructions: a comparative analysis. BMJ Quality and Safety. https://doi.org/10.1136/bmjqs-2024-018384

Evaluation of the accuracy and safety of machine translation of patient-specific discharge instructions: a comparative analysis

Abstract

Author supplied keywords

Cite

Register to see more suggestions