CStory: A Chinese Large-scale News Storyline Dataset

1Citations
Citations of this article
5Readers
Mendeley users who have this article in their library.

Abstract

In today's massive news streams, storylines can help us discover related event pairs and understand the evolution of hot events. Hence many efforts have been devoted to automatically constructing news storylines. However, the development of these methods is strongly limited by the size and quality of existing storyline datasets since news storylines are expensive to annotate as they contain a myriad of unlabeled relationships growing quadratically with the number of news events. Working around these difficulties, we propose a sophisticated pre-processing method to filter candidate news pairs by entity co-occurrence and semantic similarity. With the filter reducing annotation overhead, we construct CStory, a large-scale Chinese news storyline dataset, which contains 11,978 news articles, 112,549 manually labeled storyline relation pairs, and 49,832 evidence sentences for annotation judgment. We conduct extensive experiments on CStory using various algorithms and find that constructing news storylines is challenging even for pre-trained language models. Empirical analysis shows that the sample unbalance issue significantly influences model performance, which shall be the focus of future works. Our dataset is now publicly available at https://github.com/THU-KEG/CStory.

Cite

CITATION STYLE

APA

Shi, K., Wang, X., Yu, J., Hou, L., Li, J., Wu, J., … Liu, Q. (2022). CStory: A Chinese Large-scale News Storyline Dataset. In International Conference on Information and Knowledge Management, Proceedings (pp. 4475–4479). Association for Computing Machinery. https://doi.org/10.1145/3511808.3557573

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free