We compiled a new sentence splitting corpus that is composed of 203K pairs of aligned complex source and simplified target sentences. Contrary to previously proposed text simplification corpora, which contain only a small number of split examples, we present a dataset where each input sentence is broken down into a set of minimal propositions, i.e. a sequence of sound, self-contained utterances with each of them presenting a minimal semantic unit that cannot be further decomposed into meaningful propositions. This corpus is useful for developing sentence splitting approaches that learn how to transform sentences with a complex linguistic structure into a fine-grained representation of short sentences that present a simple and more regular structure which is easier to process for downstream applications and thus facilitates and improves their performance.
CITATION STYLE
Niklaus, C., Freitas, A., & Handschuh, S. (2019). MINWIKISPLIT: A sentence splitting corpus with minimal propositions. In INLG 2019 - 12th International Conference on Natural Language Generation, Proceedings of the Conference (pp. 118–123). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w19-8615
Mendeley helps you to discover research relevant for your work.