Background In secondary data there are often unstructuredfree texts. The aim of this study was to validate a text miningsystem to extract unstructured medical data for research purposes.Methods From a radiological department, 1,000 out of 7,102CT findings were randomly selected. These were manually dividedinto defined groups by 2 physicians. For automated taggingand reporting, the text analysis software Averbis ExtractionPlatform (AEP) was used. Special features of the systemare a morphological analysis for the decomposition of compoundwords as well as the recognition of noun phrases, abbreviationsand negated statements. Based on the extracted standardizedkeywords, findings reports were assigned to the givenfindings groups using machine learning methods. To assess thereliability and validity of the automated process, the automatedand two independent manual mappings were comparedfor matches in multiple runs.Results Manual classification was too time-consuming. In thecase of automated keywording, the classification according toICD-10 turned out to be unsuitable for our data. It also showedthat the keyword search does not deliver reliable results. Computer-aided text mining and machine learning resulted in reliableresults. The inter-rater reliability of the two manual classifications,as well as the machine and manual classification wasvery high. Both manual classifications were consistent in 93 %of all findings. The kappa coefficient is 0.89 [95 % confidenceinterval (CI) 0.87 0.92]. The automatic classification agreedwith the independent, second manual classification in 86 % ofall findings (Kappa coefficient 0.79 [95 % CI 0.75-0.81]).Discussion The classification of the software AEP was verygood. In our study, however, it followed a systematic pattern.Most misclassifications were found in findings that indicate anincreased risk of cancer. The free-text structure of the findingsraises concerns about the feasibility of a purely automated analysis.The combination of human intellect and intelligent, adaptivesoftware appears most suitable for mining unstructuredbut important textual information for research.
CITATION STYLE
Pokora, R. M., Le Cornet, L., Daumke, P., Mildenberger, P., Zeeb, H., & Blettner, M. (2020). Validation of Semantic Analyses of Unstructured Medical Data for Research Purposes. Gesundheitswesen, Supplement, 82, S158–S164. https://doi.org/10.1055/a-1007-8540
Mendeley helps you to discover research relevant for your work.