Importance of Synthesizing High-quality Data for Text-to-SQL Parsing

Yiqun Hu; Yiyun Zhao; Jiarong Jiang; Wuwei Lan; Henry Zhu; Anuj Chauhan; Alexander Li; Lin Pan; Jun Wang; Chung Wei Hang; Sheng Zhang; Jiang Guo; Marvin Dong; Joe Lilien; Patrick Ng; Zhiguo Wang; Vittorio Castelli; Bing Xiang

Conference Proceedings

Importance of Synthesizing High-quality Data for Text-to-SQL Parsing

Proceedings of the Annual Meeting of the Association for Computational Linguistics (2023) 1327-1343

DOI: 10.18653/v1/2023.findings-acl.86

12Citations

26Readers

Get full text

Abstract

There has been increasing interest in synthesizing data to improve downstream text-to-SQL tasks. In this paper, we examined the existing synthesized datasets and discovered that state-of-the-art text-to-SQL algorithms did not further improve on popular benchmarks when trained with augmented synthetic data. We observed three shortcomings: illogical synthetic SQL queries from independent column sampling, arbitrary table joins, and language gaps between the synthesized SQL and natural language question (NLQ) pair. To address these issues, we propose a novel synthesis framework that imposes strong typing constraints, incorporates key relationships from schema, and conducts schema-distance-weighted column sampling. We also adopt an intermediate representation (IR) for the SQL-to-text task to further improve the quality of the generated NLQ. When existing powerful text-to-SQL parsers are pretrained on our high-quality synthesized data, these models have significant accuracy boosts and achieve new state-of-the-art performance on Spider. We also demonstrate the effectiveness of our techniques with ablation studies.

Cite

CITATION STYLE

APA

Hu, Y., Zhao, Y., Jiang, J., Lan, W., Zhu, H., Chauhan, A., … Xiang, B. (2023). Importance of Synthesizing High-quality Data for Text-to-SQL Parsing. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 1327–1343). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-acl.86

Importance of Synthesizing High-quality Data for Text-to-SQL Parsing

Abstract

Cite

Register to see more suggestions