LR-Sum: Summarization for Less-Resourced Languages

2Citations
Citations of this article
8Readers
Mendeley users who have this article in their library.

Abstract

We introduce LR-Sum, a new permissively-licensed dataset created with the goal of enabling further research in automatic summarization for less-resourced languages. LR-Sum contains human-written summaries for 40 languages, many of which are less-resourced. We describe our process for extracting and filtering the dataset from the Multilingual Open Text corpus (Palen-Michel et al., 2022). The source data is public domain newswire collected from from Voice of America websites, and LR-Sum is released under a Creative Commons license (CC BY 4.0), making it one of the most openly-licensed multilingual summarization datasets. We describe abstractive and extractive summarization experiments to establish baselines and discuss the limitations of this dataset.

Cite

CITATION STYLE

APA

Palen-Michel, C., & Lignos, C. (2023). LR-Sum: Summarization for Less-Resourced Languages. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 6829–6844). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-acl.427

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free