CStory: A Chinese Large-scale News Storyline Dataset

Kaijie Shi; Xiaozhi Wang; Jifan Yu; Lei Hou; Juanzi Li; Jingtong Wu; Dingyu Yong; Jinghui Xiao; Qun Liu

Conference ProceedingsOPEN ACCESS

CStory: A Chinese Large-scale News Storyline Dataset

International Conference on Information and Knowledge Management, Proceedings (2022) 4475-4479

DOI: 10.1145/3511808.3557573

1Citations

5Readers

Abstract

In today's massive news streams, storylines can help us discover related event pairs and understand the evolution of hot events. Hence many efforts have been devoted to automatically constructing news storylines. However, the development of these methods is strongly limited by the size and quality of existing storyline datasets since news storylines are expensive to annotate as they contain a myriad of unlabeled relationships growing quadratically with the number of news events. Working around these difficulties, we propose a sophisticated pre-processing method to filter candidate news pairs by entity co-occurrence and semantic similarity. With the filter reducing annotation overhead, we construct CStory, a large-scale Chinese news storyline dataset, which contains 11,978 news articles, 112,549 manually labeled storyline relation pairs, and 49,832 evidence sentences for annotation judgment. We conduct extensive experiments on CStory using various algorithms and find that constructing news storylines is challenging even for pre-trained language models. Empirical analysis shows that the sample unbalance issue significantly influences model performance, which shall be the focus of future works. Our dataset is now publicly available at https://github.com/THU-KEG/CStory.

Author supplied keywords

Cite

CITATION STYLE

APA

Shi, K., Wang, X., Yu, J., Hou, L., Li, J., Wu, J., … Liu, Q. (2022). CStory: A Chinese Large-scale News Storyline Dataset. In International Conference on Information and Knowledge Management, Proceedings (pp. 4475–4479). Association for Computing Machinery. https://doi.org/10.1145/3511808.3557573

CStory: A Chinese Large-scale News Storyline Dataset

Abstract

Author supplied keywords

Cite

Register to see more suggestions