Importance of Synthesizing High-quality Data for Text-to-SQL Parsing

N/ACitations
Citations of this article
21Readers
Mendeley users who have this article in their library.

Abstract

There has been increasing interest in synthesizing data to improve downstream text-to-SQL tasks. In this paper, we examined the existing synthesized datasets and discovered that state-of-the-art text-to-SQL algorithms did not further improve on popular benchmarks when trained with augmented synthetic data. We observed three shortcomings: illogical synthetic SQL queries from independent column sampling, arbitrary table joins, and language gaps between the synthesized SQL and natural language question (NLQ) pair. To address these issues, we propose a novel synthesis framework that imposes strong typing constraints, incorporates key relationships from schema, and conducts schema-distance-weighted column sampling. We also adopt an intermediate representation (IR) for the SQL-to-text task to further improve the quality of the generated NLQ. When existing powerful text-to-SQL parsers are pretrained on our high-quality synthesized data, these models have significant accuracy boosts and achieve new state-of-the-art performance on Spider. We also demonstrate the effectiveness of our techniques with ablation studies.

Cite

CITATION STYLE

APA

Hu, Y., Zhao, Y., Jiang, J., Lan, W., Zhu, H., Chauhan, A., … Xiang, B. (2023). Importance of Synthesizing High-quality Data for Text-to-SQL Parsing. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 1327–1343). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-acl.86

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free