Evaluation of GPT-4 ability to identify and generate patient instructions for actionable incidental radiology findings

Kar Mun C. Woo; Gregory W. Simon; Olumide Akindutire; Yindalon Aphinyanaphongs; Jonathan S. Austrian; Jung G. Kim; Nicholas Genes; Jacob A. Goldenring; Vincent J. Major; Chlo S. Pariente; Edwin G. Pineda; Stella K. Kang

Journal ArticleOPEN ACCESS

Evaluation of GPT-4 ability to identify and generate patient instructions for actionable incidental radiology findings

Journal of the American Medical Informatics Association (2024) 31(9) 1983-1993

DOI: 10.1093/jamia/ocae117

3Citations

11Readers

Get full text

Abstract

Objectives: To evaluate the proficiency of a HIPAA-compliant version of GPT-4 in identifying actionable, incidental findings from unstructured radiology reports of Emergency Department patients. To assess appropriateness of artificial intelligence (AI)-generated, patient-facing summaries of these findings. Materials and Methods: Radiology reports extracted from the electronic health record of a large academic medical center were manually reviewed to identify non-emergent, incidental findings with high likelihood of requiring follow-up, further sub-stratified as “definitely actionable” (DA) or “possibly actionable—clinical correlation” (PA-CC). Instruction prompts to GPT-4 were developed and iteratively optimized using a validation set of 50 reports. The optimized prompt was then applied to a test set of 430 unseen reports. GPT-4 performance was primarily graded on accuracy identifying either DA or PA-CC findings, then secondarily for DA findings alone. Outputs were reviewed for hallucinations. AI-generated patient-facing summaries were assessed for appropriateness via Likert scale. Results: For the primary outcome (DA or PA-CC), GPT-4 achieved 99.3% recall, 73.6% precision, and 84.5% F-1. For the secondary outcome (DA only), GPT-4 demonstrated 95.2% recall, 77.3% precision, and 85.3% F-1. No findings were “hallucinated” outright. However, 2.8% of cases included generated text about recommendations that were inferred without specific reference. The majority of True Positive AI-generated summaries required no or minor revision. Conclusion: GPT-4 demonstrates proficiency in detecting actionable, incidental findings after refined instruction prompting. AI-generated patient instructions were most often appropriate, but rarely included inferred recommendations. While this technology shows promise to augment diagnostics, active clinician oversight via “human-in-the-loop” workflows remains critical for clinical implementation.

Author supplied keywords

Cite

CITATION STYLE

APA

Woo, K. M. C., Simon, G. W., Akindutire, O., Aphinyanaphongs, Y., Austrian, J. S., Kim, J. G., … Kang, S. K. (2024). Evaluation of GPT-4 ability to identify and generate patient instructions for actionable incidental radiology findings. Journal of the American Medical Informatics Association, 31(9), 1983–1993. https://doi.org/10.1093/jamia/ocae117

Evaluation of GPT-4 ability to identify and generate patient instructions for actionable incidental radiology findings

Abstract

Author supplied keywords

Cite

Register to see more suggestions