Metadata discovery of heterogeneous biomedical datasets using token-based features

Jingran Wen; Ramkiran Gouripeddi; Julio C. Facelli

Conference Proceedings

Metadata discovery of heterogeneous biomedical datasets using token-based features

Lecture Notes in Electrical Engineering (2017) 449 60-67

DOI: 10.1007/978-981-10-6451-7_8

0Citations

8Readers

Get full text

Abstract

Metadata discovery is the process of recognizing semantics and descriptors of data elements and datasets. This study uses a machine-learning approach to classify biomedical dataset characteristics for metadata discovery. Four common types of biomedical data sources were included in this study - genetic variant, protein structure, scientific publications, and general English corpus. Decision tree classification models were built using token-based features derived from these data files. These decision tree classification models are able to identify the four data sources with average F1 scores ranging from 0.935 to 1.000. This study demonstrates that biomedical data of different types have different distributions of token-based document structural features and that such structural features can be leveraged for metadata discovery.

Author supplied keywords

Cite

CITATION STYLE

APA

Wen, J., Gouripeddi, R., & Facelli, J. C. (2017). Metadata discovery of heterogeneous biomedical datasets using token-based features. In Lecture Notes in Electrical Engineering (Vol. 449, pp. 60–67). Springer Verlag. https://doi.org/10.1007/978-981-10-6451-7_8

Metadata discovery of heterogeneous biomedical datasets using token-based features

Abstract

Author supplied keywords

Cite

Register to see more suggestions