Clustering large amounts of unstructured data is an important challenge in contemporary medicine and biology. This article presents an automatic clustering method for unstructured medical data. The presented method consists of the following main steps: transformation of the document corpus to a frequency matrix of terms; dimensionality reduction of the frequency matrix of terms using principal component analysis (PCA); the direct comparison of pairs of documents similarity measures using the cosine and correlation distances; and finding the optimal number of groups for expertly labelled data sets by treating the clustering problem as an optimization problem in which the objective function is an F measure to be optimized via the selection of parameter values such as PCA resolution and the similarity threshold of the pairs of documents. The usefulness of the proposed methodology was demonstrated by performing calculations on three data sets: short sentences divided into three themes, radiological reports of aneurysms, and radiological reports of abdomen studies. A common barrier in clustering unstructured data is difficulty in results interpretation. To overcome this limitation, the utility of presentation methods, including group histograms, similarity matrices, plots of document assignment to founding clusters, F-measure interpolation and alphabetical- and term-frequency dictionaries, are presented. Excluding the labelling step, the presented method is completely automated and can be used as a preliminary data analysis method for large bodies of text to discover potential groups of interesting topics.
CITATION STYLE
Wilczek, S., Gawrysiak, K., & Spinczyk, D. (2019). Similarity search for the content of medical records using unstructured data. In Advances in Intelligent Systems and Computing (Vol. 762, pp. 506–517). Springer Verlag. https://doi.org/10.1007/978-3-319-91211-0_44
Mendeley helps you to discover research relevant for your work.