Abstract
Paraphrase generation is a long-standing task in natural language processing (NLP). Supervised paraphrase generation models, which rely on human-annotated paraphrase pairs, are cost-inefficient and hard to scale up. On the other hand, automatically annotated paraphrase pairs (e.g., by machine back-translation), usually suffer from the lack of syntactic diversity - the generated paraphrase sentences are very similar to the source sentences in terms of syntax. In this work, we present PARAAMR, a large-scale syntactically diverse paraphrase dataset created by abstract meaning representation back-translation. Our quantitative analysis, qualitative examples, and human evaluation demonstrate that the paraphrases of PARAAMR are syntactically more diverse compared to existing large-scale paraphrase datasets while preserving good semantic similarity. In addition, we show that PARAAMR can be used to improve on three NLP tasks: learning sentence embeddings, syntactically controlled paraphrase generation, and data augmentation for few-shot learning. Our results thus showcase the potential of PARAAMR for improving various NLP applications.
Cite
CITATION STYLE
Huang, K. H., Iyer, V., Hsu, I. H., Kumar, A., Chang, K. W., & Galstyan, A. (2023). ParaAMR: A Large-Scale Syntactically Diverse Paraphrase Dataset by AMR Back-Translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 8047–8061). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.acl-long.447
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.