Improving Large-scale Paraphrase Acquisition and Generation

7Citations
Citations of this article
25Readers
Mendeley users who have this article in their library.

Abstract

This paper addresses the quality issues in existing Twitter-based paraphrase datasets, and discusses the necessity of using two separate definitions of paraphrase for identification and generation tasks. We present a new Multi-Topic Paraphrase in Twitter (MULTIPIT) corpus that consists of a total of 130k sentence pairs with crowdsoursing (MULTIPITCROWD) and expert (MULTIPITEXPERT) annotations using two different paraphrase definitions for paraphrase identification, in addition to a multi-reference test set (MULTIPITNMR) and a large automatically constructed training set (MULTIPITAUTO) for paraphrase generation. With improved data annotation quality and task-specific paraphrase definition, the best pre-trained language model fine-tuned on our dataset achieves the state-of-the-art performance of 84.2 F1 for automatic paraphrase identification. Furthermore, our empirical results also demonstrate that the paraphrase generation models trained on MULTIPITAUTO generate more diverse and high-quality paraphrases compared to their counterparts fine-tuned on other corpora such as Quora, MSCOCO, and ParaNMT.

Cite

CITATION STYLE

APA

Dou, Y., Jiang, C., & Xu, W. (2022). Improving Large-scale Paraphrase Acquisition and Generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 (pp. 9301–9323). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.emnlp-main.631

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free