CSS: A Large-scale Cross-schema Chinese Text-to-SQL Medical Dataset

0Citations
Citations of this article
15Readers
Mendeley users who have this article in their library.

Abstract

The cross-domain text-to-SQL task aims to build a system that can parse user questions into SQL on complete unseen databases, and the single-domain text-to-SQL task evaluates the performance on identical databases. Both of these setups confront unavoidable difficulties in real-world applications. To this end, we introduce the cross-schema text-to-SQL task, where the databases of evaluation data are different from that in the training data but come from the same domain. Furthermore, we present CSS, a large-scale CrosS-Schema Chinese text-to-SQL dataset, to carry on corresponding studies. CSS originally consisted of 4,340 question/SQL pairs across 2 databases. In order to generalize models to different medical systems, we extend CSS and create 19 new databases along with 29,280 corresponding dataset examples. Moreover, CSS is also a large corpus for single-domain Chinese text-to-SQL studies. We present the data collection approach and a series of analyses of the data statistics. To show the potential and usefulness of CSS, benchmarking baselines have been conducted and reported. Our dataset is publicly available at https://huggingface.co/datasets/zhanghanchong/css.

Cite

CITATION STYLE

APA

Zhang, H., Li, J., Chen, L., Cao, R., Zhang, Y., Huang, Y., … Yu, K. (2023). CSS: A Large-scale Cross-schema Chinese Text-to-SQL Medical Dataset. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 6970–6983). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-acl.435

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free