Text simplification tools: Using machine learning to discover features that identify difficult text

24Citations
Citations of this article
56Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Although providing understandable information is a critical component in healthcare, few tools exist to help clinicians identify difficult sections in text. We systematically examine sixteen features for predicting the difficulty of health texts using six different machine learning algorithms. Three represent new features not previously examined: medical concept density; specificity (calculated using word-level depth in MeSH); and ambiguity (calculated using the number of UMLS Metathesaurus concepts associated with a word). We examine these features for a binary prediction task on 118,000 simple and difficult sentences from a sentence-aligned corpus. Using all features, random forests is the most accurate with 84% accuracy. Model analysis of the six models and a complementary ablation study shows that the specificity and ambiguity features are the strongest predictors (24% combined impact on accuracy). Notably, a training size study showed that even with a 1% sample (1,062 sentences) an accuracy of 80% can be achieved. © 2014 IEEE.

Cite

CITATION STYLE

APA

Kauchak, D., Mouradi, O., Pentoney, C., & Leroy, G. (2014). Text simplification tools: Using machine learning to discover features that identify difficult text. In Proceedings of the Annual Hawaii International Conference on System Sciences (pp. 2616–2625). IEEE Computer Society. https://doi.org/10.1109/HICSS.2014.330

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free