PHDD: Corpus of Physical Health Data Disclosure on Twitter During COVID-19 Pandemic

3Citations
Citations of this article
13Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Health-related information is considered as ‘highly sensitive’ by the European General Data Protection Regulations (GDPR) and determining whether a text document contains health-related information or not is of interest for both individuals and companies in a number of different scenarios. Although some efforts have been made to detect different categories of personal data in texts, including health information, the classification task by machines is still challenging. In this work, we aim to contribute to solving this challenge by building a corpus of tweets being shared in the current COVID-19 pandemic context. The corpus is called PHDD(Corpus of Physical Health Data Disclosure on Twitter During COVID-19 Pandemic) and contains 1,494 tweets which have been manually tagged by three taggers in three dimensions: health-sensitivity status, categories of health information, and subject of health history. Furthermore, a lightweight ontology called PTHI(Privacy Tags for Health Information), which reuses two other vocabularies, namely hl7 and dpv, is built to represent the corpus in a machine-readable format. The corpus is publicly available and can be used by NLP experts for implementation of techniques to detect sensitive health information in textual documents.

Cite

CITATION STYLE

APA

Saniei, R., & Rodríguez Doncel, V. (2022). PHDD: Corpus of Physical Health Data Disclosure on Twitter During COVID-19 Pandemic. SN Computer Science, 3(3). https://doi.org/10.1007/s42979-022-01097-x

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free