ParaAMR: A Large-Scale Syntactically Diverse Paraphrase Dataset by AMR Back-Translation

16Citations
Citations of this article
20Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Paraphrase generation is a long-standing task in natural language processing (NLP). Supervised paraphrase generation models, which rely on human-annotated paraphrase pairs, are cost-inefficient and hard to scale up. On the other hand, automatically annotated paraphrase pairs (e.g., by machine back-translation), usually suffer from the lack of syntactic diversity - the generated paraphrase sentences are very similar to the source sentences in terms of syntax. In this work, we present PARAAMR, a large-scale syntactically diverse paraphrase dataset created by abstract meaning representation back-translation. Our quantitative analysis, qualitative examples, and human evaluation demonstrate that the paraphrases of PARAAMR are syntactically more diverse compared to existing large-scale paraphrase datasets while preserving good semantic similarity. In addition, we show that PARAAMR can be used to improve on three NLP tasks: learning sentence embeddings, syntactically controlled paraphrase generation, and data augmentation for few-shot learning. Our results thus showcase the potential of PARAAMR for improving various NLP applications.

Cite

CITATION STYLE

APA

Huang, K. H., Iyer, V., Hsu, I. H., Kumar, A., Chang, K. W., & Galstyan, A. (2023). ParaAMR: A Large-Scale Syntactically Diverse Paraphrase Dataset by AMR Back-Translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 8047–8061). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.acl-long.447

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free