Abstract
Objective: Automated understanding of consumer health inquiries might be hindered by misspellings. To detect and correct various types of spelling errors in consumer health questions, we developed a distributable spell-checking tool, CSpell, that handles nonword errors, real-word errors, word boundary infractions, punctuation errors, and combinations of the above. Methods: We developed a novel approach of using dual embedding within Word2vec for context-dependent corrections. This technique was used in combination with dictionary-based corrections in a 2-stage ranking system. We also developed various splitters and handlers to correct word boundary infractions. All correction approaches are integrated to handle errors in consumer health questions. Results: Our approach achieves an F1 score of 80.93% and 69.17% for spelling error detection and correction, respectively. Discussion: The dual-embedding model shows a significant improvement (9.13%) in F1 score compared with the general practice of using cosine similarity with word vectors in Word2vec for context ranking. Our 2-stage ranking system shows a 4.94% improvement in F1 score compared with the best 1-stage ranking system. Conclusion: CSpell improves over the state of the art and provides near real-time automatic misspelling detection and correction in consumer health questions. The software and the CSpell test set are available at https://umlslex.nlm.nih.gov/cSpell.
Author supplied keywords
Cite
CITATION STYLE
Lu, C. J., Aronson, A. R., Shooshan, S. E., & Demner-Fushman, D. (2019). Spell checker for consumer language (CSpell). Journal of the American Medical Informatics Association, 26(3), 211–218. https://doi.org/10.1093/jamia/ocy171
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.