The Chinese SQL Semantic Parsing Dataset

What is CSpider?

Mar. 1st, 2024: Following the policy of Spider, we have decided to make the CSpider test set publicly available. You are encouraged to freely test it by checking the CSpider dataset link below. Please note that we will no longer accept submissions for CSpider.

CSpider is a Chinese large-scale complex and cross-domain semantic parsing and text-to-SQL dataset translated from Spider by 2 NLP researchers and 1 computer science student. The goal of the CSpider challenge is to develop natural language interfaces to cross-domain databases for Chinese, which is currently a low-resource language in this task area. It consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables covering 138 different domains. Following Spider 1.0, in CSpider, different complex SQL queries and databases appear in train and test sets. To do well on it, systems must generalize well to not only new SQL queries but also new database schemas.

CSipder is translated from Spider. However, there can be added challenges. First, structures of relational databases, in particular names and column names of DB tables, are typically represented in English. This adds to the challenges to question-to-DB mapping. Second, the basic semantic unit for denoting columns or cells can be words, but word segmentation can be erroneous. It is also interesting to study the influence of other linguistic characteristics of Chinese, such as zero-pronoun, on its SQL parsing.

CSpider Paper (EMNLP'19)

Getting Started

The data is split into training, development, and unreleased test sets. Download a copy of the dataset (distributed under the CC BY-SA 4.0 license):

CSpider Dataset (Google Drive)

CSpider Dataset (BaiduNetDisk with code: 9gzb)

Details of baseline models and evaluation script can be found on the following GitHub site:

CSpider Github Page

Once you have a built a model that works to your expectations on the dev set, you submit it to get official scores on the dev and a hidden test set. To preserve the integrity of test results, we do not release the test set to the public. Instead, we require you to submit your model so that we can run it on the test set for you. Here's a tutorial walking you through official evaluation of your model:

Submission Tutorial

Data Examples

Some examples look like the following:

Have Questions?

Ask us questions at our Github issues page or contact minqingkai@westlake.edu.cn or shiyuefeng@westlake.edu.cn.

We expect the dataset to evolve. We would greatly appreciate it if you could donate us your non-private databases or SQL queries for the project.

Acknowledgement

We thank Tao Yu for sharing the original Spider test set with us, and the anonymous reviewers for their precious comments on this project. Also, we thank Pranav Rajpurkar and Tao Yu for giving us the permission to build this website based on SQuAD and Spider.

Star

Leaderboard - Exact Set Match without Values

Following Spider, we take exact matching evaluation. Instead of simply conducting string comparison between the predicted and gold SQL queries, we decompose each SQL into several clauses, and conduct set comparison in each SQL clause. Please refer to our Github page or the Spider paper and its Github page for more details.

Rank	Model	Dev	Test
1 January 16, 2024	FastRAT + AST Ranking + GPT-4 HUAWEI Poisson-ERC-KG Lab & HUAWEI Cloud	66.2	62.1
2 June 29, 2023	Roberta + Seq2SQL Beijing PERCENT Technology Group Co.，Ltd.	66.2	60.6
3 May 24, 2022	LGESQL + GTL + Electra + QT SJTU X-LANCE Lab	64.0	60.3
4 December 31, 2021	LGESQL + ELECTRA + QT HUAWEI Poisson Lab & HUAWEI Cloud	64.5	58.1
5 May 24, 2022	LGESQL + GTL + Infoxlm SJTU X-LANCE Lab	61.0	57.0
6 November 27, 2020	RAT-SQL + GraPPa + Adv Alibaba	59.7	56.2
7 May 24, 2022	LGESQL + GTL + Multilingual BERT SJTU X-LANCE Lab	58.6	52.7
8 July 8, 2020	XL-SQL Anonymous	54.9	47.8
9 November 25, 2020	DG-SQL + Multilingual BERT University of Edinburgh https://arxiv.org/abs/2010.11988	50.4	46.9
10 October 10, 2020	RAT-SQL (without schema linking) + Multilingual BERT Anonymous	41.4	37.3
11 July 15, 2020	RYANSQL + Multilingual BERT Kakao Enterprise https://arxiv.org/abs/2004.03125	41.3	34.7
12 July 8, 2020	DG-SQL Anonymous	35.5	26.8
13 Nov 28, 2019	CN-SQL oneconnect	22.9	18.8
14 Sep 18, 2019	SyntaxSQLNet (based on Yu et al. (2018a)) Westlake University https://arxiv.org/abs/1909.13293	16.4	13.3

CSpider 1.0

The Chinese Semantic Parsing and Text-to-SQL Challenge