Performance of Retrieval-Augmented Generation Large Language Models in Guideline-Concordant Prostate-Specific Antigen Testing: Comparative Study With Junior Clinicians

Joshua Yi Min Tung; Quan Le; Jinxuan Yao; Yifei Huang; Daniel Yan Zheng Lim; Gerald Gui Ren Sng; Rachel Shu En Lau; Yu Guang Tan; Kenneth Chen; Kae Jack Tay; Jen Hong Tan; John Shyi Peng Yuen; Christopher Wai Sam Cheng; Henry Sun Sien Ho

Journal ArticleOPEN ACCESS

Performance of Retrieval-Augmented Generation Large Language Models in Guideline-Concordant Prostate-Specific Antigen Testing: Comparative Study With Junior Clinicians

Journal of Medical Internet Research (2025) 27

DOI: 10.2196/78393

1Citations

31Readers

Get full text

Abstract

Background: Prostate-specific antigen (PSA) testing remains the cornerstone of early prostate cancer detection. Society guidelines for prostate cancer screening via PSA testing serve to standardize patient care and are often used by trainees, junior staff, or generalist medical practitioners to guide medical decision-making. However, adherence to guidelines is a time-consuming and challenging task, and rates of inappropriate PSA testing are high. Retrieval-augmented generation (RAG) is a method to enhance the reliability of large language models (LLMs) by grounding responses in trusted external sources. Objective: This study aimed to evaluate a RAG-enhanced LLM system, grounded in current European Association of Urology and American Urological Association guidelines, to assess its effectiveness in providing guideline-concordant PSA screening recommendations compared to junior clinicians. Methods: A series of 44 fictional outpatient case scenarios was developed to represent a broad spectrum of clinical presentations. A RAG pipeline was developed, comprising a life expectancy estimation module based on the Charlson Comorbidity Index, followed by LLM-generated recommendations constrained to retrieved excerpts from the European Association of Urology and American Urological Association guidelines. Five junior clinicians were tasked to provide PSA testing recommendations for the same scenarios in closed-book and open-book formats. Answers were compared for accuracy in a binomial fashion. Fleiss κ was computed to assess interrater agreement among clinicians. Results: The RAG-LLM tool provided guideline-concordant recommendations in 95.5% (210/220) of case scenarios, compared to junior clinicians, who were correct in 62.3% (137/220) of scenarios in a closed-book format and 74.1% (163/220) of scenarios in an open-book format. The difference was statistically significant for both closed-book (P

Author supplied keywords

Cite

CITATION STYLE

APA

Tung, J. Y. M., Le, Q., Yao, J., Huang, Y., Lim, D. Y. Z., Sng, G. G. R., … Ho, H. S. S. (2025). Performance of Retrieval-Augmented Generation Large Language Models in Guideline-Concordant Prostate-Specific Antigen Testing: Comparative Study With Junior Clinicians. Journal of Medical Internet Research, 27. https://doi.org/10.2196/78393

Performance of Retrieval-Augmented Generation Large Language Models in Guideline-Concordant Prostate-Specific Antigen Testing: Comparative Study With Junior Clinicians

Abstract

Author supplied keywords

Cite

Register to see more suggestions