Avoiding Overlap in Data Augmentation for AMR-to-Text Generation

Wenchao Du; Jeffrey Flanigan

Conference Proceedings

Avoiding Overlap in Data Augmentation for AMR-to-Text Generation

ACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference (2021) 2 1043-1048

DOI: 10.18653/v1/2021.acl-short.132

1Citations

57Readers

Get full text

Abstract

Leveraging additional unlabeled data to boost model performance is common practice in machine learning and natural language processing. For generation tasks, if there is overlap between the additional data and the target text evaluation data, then training on the additional data is training on answers of the test set. This leads to overly-inflated scores with the additional data compared to real-world testing scenarios and problems when comparing models. We study the AMR dataset and Gigaword, which is popularly used for improving AMRto- text generators, and find significant overlap between Gigaword and a subset of the AMR dataset. We propose methods for excluding parts of Gigaword to remove this overlap, and show that our approach leads to a more realistic evaluation of the task of AMR-to-text generation. Going forward, we give simple bestpractice recommendations for leveraging additional data in AMR-to-text generation.

Cite

CITATION STYLE

APA

Du, W., & Flanigan, J. (2021). Avoiding Overlap in Data Augmentation for AMR-to-Text Generation. In ACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference (Vol. 2, pp. 1043–1048). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.acl-short.132

Avoiding Overlap in Data Augmentation for AMR-to-Text Generation

Abstract

Cite

Register to see more suggestions