Abstract
Background: The fact that medical terms require special expertise and are becoming increasingly complex makes it difficult to employ natural language processing techniques in medical informatics. Several human-validated reference standards for medical terms have been developed to evaluate word embedding models using the semantic similarity and relatedness of medical word pairs. However, there are very few reference standards in non-English languages. In addition, because the existing reference standards were developed a long time ago, there is a need to develop an updated standard to represent recent findings in medical sciences. Objective: We propose a new Korean word pair reference set to verify embedding models. Methods: From January 2010 to December 2020, 518 medical textbooks, 72,844 health information news, and 15,698 medical research articles were collected, and the top 10,000 medical terms were selected to develop medical word pairs. Attending physicians (n=16) participated in the verification of the developed set with 607 word pairs. Results: The proportion of word pairs answered by all participants was 90.8% (551/607) for the similarity task and 86.5% (525/605) for the relatedness task. The similarity and relatedness of the word pair showed a high correlation (ρ=0.70, P
Author supplied keywords
Cite
CITATION STYLE
Yum, Y., Lee, J. M., Jang, M. J., Kim, Y., Kim, J. H., Kim, S., … Joo, H. J. (2021). A word pair dataset for semantic similarity and relatedness in korean medical vocabulary: Reference development and validation. JMIR Medical Informatics, 9(6). https://doi.org/10.2196/29667
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.