ParaAMR: A Large-Scale Syntactically Diverse Paraphrase Dataset by AMR Back-Translation

Kuan Hao Huang; Varun Iyer; I. Hung Hsu; Anoop Kumar; Kai Wei Chang; Aram Galstyan

Conference Proceedings

ParaAMR: A Large-Scale Syntactically Diverse Paraphrase Dataset by AMR Back-Translation

Proceedings of the Annual Meeting of the Association for Computational Linguistics (2023) 1 8047-8061

DOI: 10.18653/v1/2023.acl-long.447

16Citations

20Readers

Get full text

Abstract

Paraphrase generation is a long-standing task in natural language processing (NLP). Supervised paraphrase generation models, which rely on human-annotated paraphrase pairs, are cost-inefficient and hard to scale up. On the other hand, automatically annotated paraphrase pairs (e.g., by machine back-translation), usually suffer from the lack of syntactic diversity - the generated paraphrase sentences are very similar to the source sentences in terms of syntax. In this work, we present PARAAMR, a large-scale syntactically diverse paraphrase dataset created by abstract meaning representation back-translation. Our quantitative analysis, qualitative examples, and human evaluation demonstrate that the paraphrases of PARAAMR are syntactically more diverse compared to existing large-scale paraphrase datasets while preserving good semantic similarity. In addition, we show that PARAAMR can be used to improve on three NLP tasks: learning sentence embeddings, syntactically controlled paraphrase generation, and data augmentation for few-shot learning. Our results thus showcase the potential of PARAAMR for improving various NLP applications.

Cite

CITATION STYLE

APA

Huang, K. H., Iyer, V., Hsu, I. H., Kumar, A., Chang, K. W., & Galstyan, A. (2023). ParaAMR: A Large-Scale Syntactically Diverse Paraphrase Dataset by AMR Back-Translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 8047–8061). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.acl-long.447

ParaAMR: A Large-Scale Syntactically Diverse Paraphrase Dataset by AMR Back-Translation

Abstract

Cite

Register to see more suggestions