SchemaPile: A Large Collection of Relational Database Schemas-Reference-Cited by-同舟云学术

SchemaPile: A Large Collection of Relational Database Schemas

Published:2024-05-29 Issue:3 Volume:2 Page:1-25
ISSN:2836-6573
Container-title:Proceedings of the ACM on Management of Data
language:en
Short-container-title:Proc. ACM Manag. Data

Author:

Döhmen Till¹^ORCID,Geacu Radu¹^ORCID,Hulsebos Madelon²^ORCID,Schelter Sebastian¹^ORCID

Affiliation:

1. University of Amsterdam, Amsterdam, NL

2. UC Berkeley, Berkeley, USA

Abstract

Access to fine-grained schema information is crucial for understanding how relational databases are designed and used in practice, and for building systems that help users interact with them. Furthermore, such information is required as training data to leverage the potential of large language models (LLMs) for improving data preparation, data integration and natural language querying. Existing single-table corpora such as GitTables provide insights into how tables are structured in-the-wild, but lack detailed schema information about how tables relate to each other, as well as metadata like data types or integrity constraints. On the other hand, existing multi-table (or database schema) datasets are rather small and attribute-poor, leaving it unclear to what extent they actually represent typical real-world database schemas. In order to address these challenges, we present SchemaPile, a corpus of 221,171 database schemas, extracted from SQL files on GitHub. It contains 1.7 million tables with 10 million column definitions, 700 thousand foreign key relationships, seven million integrity constraints, and data content for more than 340 thousand tables. We conduct an in-depth analysis on the millions of schema metadata properties in our corpus, as well as its highly diverse language and topic distribution. In addition, we showcase the potential of \corpus to improve a variety of data management applications, e.g., fine-tuning LLMs for schema-only foreign key detection, improving CSV header detection and evaluating multi-dialect SQL parsers. We publish the code and data for recreating SchemaPile and a permissively licensed subset SchemaPile-Perm.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3654975

Reference58 articles.

1. Detecting data errors

2. Andi Albrecht. 2023. python-sqlparse -- a non-validating SQL parser for Python. https://github.com/andialbrecht/sqlparse

3. WebTables

4. Aurum: A Data Discovery System

5. Cody James Christopher, Kristen Moore, and David Liebowitz. 2021. SchemaDB: Structures in relational datasets. arXiv preprint arXiv:2111.12835 (2021).

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Directions Towards Efficient and Automated Data Wrangling with Large Language Models;2024 IEEE 40th International Conference on Data Engineering Workshops (ICDEW);2024-05-13