Forecasting COVID-19 Caseloads Using Unsupervised Embedding Clusters of Social Media Posts

4Citations
Citations of this article
28Readers
Mendeley users who have this article in their library.

Abstract

We present a novel approach incorporating transformer-based language models into infectious disease modelling. Text-derived features are quantified by tracking high-density clusters of sentence-level representations of Reddit posts within specific US states' COVID-19 subreddits. We benchmark these clustered embedding features against features extracted from other high-quality datasets. In a threshold-classification task, we show that they outperform all other feature types at predicting upward trend signals, a significant result for infectious disease modelling in areas where epidemiological data is unreliable. Subsequently, in a time-series forecasting task we fully utilise the predictive power of the caseload and compare the relative strengths of using different supplementary datasets as covariate feature sets in a transformer-based time-series model.

Cite

CITATION STYLE

APA

Drinkall, F., Zohren, S., & Pierrehumbert, J. B. (2022). Forecasting COVID-19 Caseloads Using Unsupervised Embedding Clusters of Social Media Posts. In NAACL 2022 - 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference (pp. 1471–1484). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.naacl-main.105

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free