A Novel Wikipedia based Dataset for Monolingual and Cross-Lingual Summarization

11Citations
Citations of this article
44Readers
Mendeley users who have this article in their library.

Abstract

Cross-lingual summarization is a challenging task for which there are no cross-lingual scientific resources currently available. To overcome the lack of a high-quality resource, we present a new dataset for monolingual and cross-lingual summarization considering the English-German pair. We collect high-quality, real-world cross-lingual data from Spektrum der Wissenschaft, which publishes human-written German scientific summaries of English science articles on various subjects. The generated Spektrum dataset is small; therefore, we harvest a similar dataset from the Wikipedia Science Portal to complement it. The Wikipedia dataset consists of English and German articles, which can be used for monolingual and cross-lingual summarization. Furthermore, we present a quantitative analysis of the datasets and results of empirical experiments with several existing extractive and abstractive summarization models. The results suggest the viability and usefulness of the proposed dataset for monolingual and cross-lingual summarization.

Cite

CITATION STYLE

APA

Fatima, M., & Strube, M. (2021). A Novel Wikipedia based Dataset for Monolingual and Cross-Lingual Summarization. In 3rd Workshop on New Frontiers in Summarization, NewSum 2021 - Workshop Proceedings (pp. 39–50). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.newsum-1.5

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free