DIR: A Large-Scale Dialogue Rewrite Dataset for Cross-Domain Conversational Text-to-SQL-Reference-Cited by-同舟云学术

DIR: A Large-Scale Dialogue Rewrite Dataset for Cross-Domain Conversational Text-to-SQL

Published:2023-02-09 Issue:4 Volume:13 Page:2262
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Li Jieyu¹,Chen Zhi¹,Chen Lu¹,Zhu Zichen¹^ORCID,Li Hanqi¹,Cao Ruisheng¹,Yu Kai¹

Affiliation:

1. X-LANCE Lab, MoE Key Lab of Artificial Intelligence, AI Institute, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

Abstract

Semantic co-reference and ellipsis always lead to information deficiency when parsing natural language utterances with SQL in a multi-turn dialogue (i.e., conversational text-to-SQL task). The methodology of dividing a dialogue understanding task into dialogue utterance rewriting and language understanding is feasible to tackle this problem. To this end, we present a two-stage framework to complete conversational text-to-SQL tasks. To construct an efficient rewriting model in the first stage, we provide a large-scale dialogue rewrite dataset (DIR), which is extended from two cross-domain conversational text-to-SQL datasets, SParC and CoSQL. The dataset contains 5908 dialogues involving 160 domains. Therefore, it not only focuses on conversational text-to-SQL tasks, but is also a valuable corpus for dialogue rewrite study. In experiments, we validate the efficiency of our annotations with a popular text-to-SQL parser, RAT-SQL. The experiment results illustrate 11.81 and 27.17 QEM accuracy improvement on SParC and CoSQL, respectively, when we eliminate the semantic incomplete representations problem by directly parsing the golden rewrite utterances. The experiment results of evaluating the performance of the two-stage frameworks using different rewrite models show that the efficiency of rewrite models is important and still needs improvement. Additionally, as a new benchmark of the dialogue rewrite task, we also report the performance results of different baselines for related studies. Our dataset will be publicly available once this paper is accepted.

Funder

China NSFC Projects

Shanghai Municipal Science and Technology Major Project

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/13/4/2262/pdf

Reference40 articles.

1. Natural language interfaces to databases-an introduction;Androutsopoulos;Nat. Lang. Eng.,1995

2. Quan, J., Xiong, D., Webber, B., and Hu, C. (2019, January 3–7). GECOR: An End-to-End Generative Ellipsis and Co-reference Resolution Model for Task-Oriented Dialogue. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China.

3. Chen, Z., Chen, L., Li, H., Cao, R., Ma, D., Wu, M., and Yu, K. (2021, January 1–6). Decoupled Dialogue Modeling and Semantic Parsing for Multi-Turn Text-to-SQL. Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Virtual.

4. Yu, T., Zhang, R., Yasunaga, M., Tan, Y.C., Lin, X.V., Li, S., Heyang Er, I.L., Pang, B., Chen, T., and Ji, E. (August, January 28). SParC: Cross-Domain Semantic Parsing in Context. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics, Florence, Italy.

5. Yu, T., Zhang, R., Er, H., Li, S., Xue, E., Pang, B., Lin, X.V., Tan, Y.C., Shi, T., and Li, Z. (2019, January 3–7). CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China.

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Dialogue-Rewriting Model Based on Transformer Pointer Extraction;Electronics;2024-06-17

2. A Survey of Natural Language-Based Editing of Low-Code Applications Using Large Language Models;Lecture Notes in Computer Science;2024

3. Interactivity;Natural Language Interfaces to Databases;2023-11-25