In a user-generated text such as on social media platforms and online forums, people often use disease or symptom terms in ways other than to describe their health. In data-driven public health surveillance, the health mention classification (HMC) task aims to identify posts where users are discussing health conditions rather than using disease and symptom terms for other reasons. Existing computational research typically only studies health mentions in Twitter, with limited coverage of disease or symptom terms, ignore user behavior information, and other ways people use disease or symptom terms. To advance the HMC research, we present a Reddit health mention dataset (RHMD), a new dataset of multi-domain Reddit data for the HMC. RHMD consists of 10,015 manually labeled Reddit posts that mention 15 common disease or symptom terms and are annotated with four labels: namely personal health mentions, non-personal health mentions, figurative health mentions, and hyperbolic health mentions. With RHMD, we propose HMCNET that combines a target keyword (disease or symptom term) identification and user behavior hierarchically to improve HMC. Experimental results demonstrate that the proposed approach outperforms state-of-the-art methods with an F1-Score of 0.75 (an increase of 11% over the state-of-the-art) and shows that our new dataset poses a strong challenge to the existing HMC methods.
CITATION STYLE
Naseem, U., Kim, J., Khushi, M., & Dunn, A. G. (2022). Identification of Disease or Symptom terms in Reddit to Improve Health Mention Classification. In WWW 2022 - Proceedings of the ACM Web Conference 2022 (pp. 2573–2581). Association for Computing Machinery, Inc. https://doi.org/10.1145/3485447.3512129
Mendeley helps you to discover research relevant for your work.