Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation

22Citations
Citations of this article
51Readers
Mendeley users who have this article in their library.

Abstract

In recent years, machine learning models have rapidly become better at generating clinical consultation notes; yet, there is little work on how to properly evaluate the generated consultation notes to understand the impact they may have on both the clinician using them and the patient's clinical safety. To address this we present an extensive human evaluation study of consultation notes where 5 clinicians (i) listen to 57 mock consultations, (ii) write their own notes, (iii) post-edit a number of automatically generated notes, and (iv) extract all the errors, both quantitative and qualitative. We then carry out a correlation study with 18 automatic quality metrics and the human judgements. We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore. All our findings and annotations are open-sourced.

Cite

CITATION STYLE

APA

Moramarco, F., Korfiatis, A. P., Perera, M., Juric, D., Flann, J., Reiter, E., … Savkov, A. (2022). Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 5739–5754). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.acl-long.394

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free