Finding the needle in a haystack: Extraction of Informative COVID-19 Danish Tweets

Benjamin Olsen; Barbara Plank

Conference ProceedingsOPEN ACCESS

Finding the needle in a haystack: Extraction of Informative COVID-19 Danish Tweets

W-NUT 2021 - 7th Workshop on Noisy User-Generated Text, Proceedings of the Conference (2021) 11-19

DOI: 10.18653/v1/2021.wnut-1.2

2Citations

41Readers

Abstract

Finding informative COVID-19 posts in a stream of tweets is very useful to monitor health-related updates. Prior work focused on a balanced data setup and on English, but informative tweets are rare, and English is only one of the many languages spoken in the world. In this work, we introduce a new dataset of 5,000 tweets for finding informative COVID-19 tweets for Danish. In contrast to prior work, which balances the label distribution, we model the problem by keeping its natural distribution. We examine how well a simple probabilistic model and a convolutional neural network (CNN) perform on this task. We find a weighted CNN to work well but it is sensitive to embedding and hyperparameter choices. We hope the contributed dataset is a starting point for further work in this direction.

Cite

CITATION STYLE

APA

Olsen, B., & Plank, B. (2021). Finding the needle in a haystack: Extraction of Informative COVID-19 Danish Tweets. In W-NUT 2021 - 7th Workshop on Noisy User-Generated Text, Proceedings of the Conference (pp. 11–19). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.wnut-1.2

Finding the needle in a haystack: Extraction of Informative COVID-19 Danish Tweets

Abstract

Cite

Register to see more suggestions