Metadata discovery is the process of recognizing semantics and descriptors of data elements and datasets. This study uses a machine-learning approach to classify biomedical dataset characteristics for metadata discovery. Four common types of biomedical data sources were included in this study - genetic variant, protein structure, scientific publications, and general English corpus. Decision tree classification models were built using token-based features derived from these data files. These decision tree classification models are able to identify the four data sources with average F1 scores ranging from 0.935 to 1.000. This study demonstrates that biomedical data of different types have different distributions of token-based document structural features and that such structural features can be leveraged for metadata discovery.
CITATION STYLE
Wen, J., Gouripeddi, R., & Facelli, J. C. (2017). Metadata discovery of heterogeneous biomedical datasets using token-based features. In Lecture Notes in Electrical Engineering (Vol. 449, pp. 60–67). Springer Verlag. https://doi.org/10.1007/978-981-10-6451-7_8
Mendeley helps you to discover research relevant for your work.