AbstractObjective: To characterize text and sublanguage in medical records to better address challenges within NaturalLanguage Processing (NLP) tasks such as information extraction, word sense disambiguation, information retrieval,and text summarization. The text and sublanguage analysis is needed to scale up the NLP development for large anddiverse free-text clinical data sets.Design: This is a quantitative descriptive study which analyzes the text and sublanguage characteristics of avery large Veteran Affairs (VA) clinical note corpus (569 million notes) to guide the customization of natural languageprocessing (NLP) of VA notes.Methods: We randomly sampled 100,000 notes from the top 100 most frequently appearing document types. Weexamined surface features and used those features to identify sublanguage groups using unsupervised clustering.Results: Using the text features we are able to characterize each of the 100 document types and identify 16 distinctsublanguage groups. The identified sublanguages reflect different clinical domains and types of encounters within thesample corpus. We also found much variance within each of the document types. Such characteristics will facilitate thetuning and crafting of NLP tools.Conclusion: Using a diverse and large sample of clinical text, we were able to show there are a relatively largenumber of sublanguages and variance both within and between document types. These findings will guide NLPdevelopment to create more customizable and generalizable solutions across medical domains and sublanguages.
CITATION STYLE
T. Zeng, Q. (2013). Characterizing Clinical Text and Sublanguage: A Case Study of the VA Clinical Notes. Journal of Health & Medical Informatics, 04(02). https://doi.org/10.4172/2157-7420.s3-001
Mendeley helps you to discover research relevant for your work.