We present a novel approach incorporating transformer-based language models into infectious disease modelling. Text-derived features are quantified by tracking high-density clusters of sentence-level representations of Reddit posts within specific US states' COVID-19 subreddits. We benchmark these clustered embedding features against features extracted from other high-quality datasets. In a threshold-classification task, we show that they outperform all other feature types at predicting upward trend signals, a significant result for infectious disease modelling in areas where epidemiological data is unreliable. Subsequently, in a time-series forecasting task we fully utilise the predictive power of the caseload and compare the relative strengths of using different supplementary datasets as covariate feature sets in a transformer-based time-series model.
CITATION STYLE
Drinkall, F., Zohren, S., & Pierrehumbert, J. B. (2022). Forecasting COVID-19 Caseloads Using Unsupervised Embedding Clusters of Social Media Posts. In NAACL 2022 - 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference (pp. 1471–1484). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.naacl-main.105
Mendeley helps you to discover research relevant for your work.