Automated Patient Note Grading: Examining Scoring Reliability and Feasibility

William F. Bond; Jianing Zhou; Suma Bhat; Yoon Soo Park; Rebecca A. Ebert-Allen; Rebecca L. Ruger; Rachel Yudkowsky

Journal ArticleOPEN ACCESS

Automated Patient Note Grading: Examining Scoring Reliability and Feasibility

Academic Medicine (2023) 98(11) S90-S97

DOI: 10.1097/ACM.0000000000005357

6Citations

19Readers

Get full text

Abstract

Purpose Scoring postencounter patient notes (PNs) yields significant insights into student performance, but the resource intensity of scoring limits its use. Recent advances in natural language processing (NLP) and machine learning allow application of automated short answer grading (ASAG) for this task. This retrospective study evaluated psychometric characteristics and reliability of an ASAG system for PNs and factors contributing to implementation, including feasibility and case-specific phrase annotation required to tune the system for a new case. Method PNs from standardized patient (SP) cases within a graduation competency exam were used to train the ASAG system, applying a feed-forward neural networks algorithm for scoring. Using faculty phrase-level annotation, 10 PNs per case were required to tune the ASAG system. After tuning, ASAG item-level ratings for 20 notes were compared across ASAG-faculty (4 cases, 80 pairings) and ASAG-nonfaculty (2 cases, 40 pairings). Psychometric characteristics were examined using item analysis and Cronbach's alpha. Inter-rater reliability (IRR) was examined using kappa. Results ASAG scores demonstrated sufficient variability in differentiating learner PN performance and high IRR between machine and human ratings. Across all items the ASAG-faculty scoring mean kappa was.83 (SE ±.02). The ASAG-nonfaculty pairings kappa was.83 (SE ±.02). The ASAG scoring demonstrated high item discrimination. Internal consistency reliability values at the case level ranged from a Cronbach's alpha of.65 to.77. Faculty time cost to train and supervise nonfaculty raters for 4 cases was approximately $1,856. Faculty cost to tune the ASAG system was approximately $928. Conclusions NLP-based automated scoring of PNs demonstrated a high degree of reliability and psychometric confidence for use as learner feedback. The small number of phrase-level annotations required to tune the system to a new case enhances feasibility. ASAG-enabled PN scoring has broad implications for improving feedback in case-based learning contexts in medical education.

Cite

CITATION STYLE

APA

Bond, W. F., Zhou, J., Bhat, S., Park, Y. S., Ebert-Allen, R. A., Ruger, R. L., & Yudkowsky, R. (2023). Automated Patient Note Grading: Examining Scoring Reliability and Feasibility. Academic Medicine, 98(11), S90–S97. https://doi.org/10.1097/ACM.0000000000005357

Automated Patient Note Grading: Examining Scoring Reliability and Feasibility

Abstract

Cite

Register to see more suggestions