GitTables: A Large-Scale Corpus of Relational Tables-Reference-Cited by-同舟云学术

GitTables: A Large-Scale Corpus of Relational Tables

Published:2023-05-26 Issue:1 Volume:1 Page:1-17
ISSN:2836-6573
Container-title:Proceedings of the ACM on Management of Data
language:en
Short-container-title:Proc. ACM Manag. Data

Author:

Hulsebos Madelon¹^ORCID,Demiralp Çagatay²^ORCID,Groth Paul¹^ORCID

Affiliation:

1. University of Amsterdam, Amsterdam, Netherlands

2. Sigma Computing, San Francisco, CA, USA

Abstract

The success of deep learning has sparked interest in improving relational table tasks, like data preparation and search, with table representation models trained on large table corpora. Existing table corpora primarily contain tables extracted from HTML pages, limiting the capability to represent offline database tables. To train and evaluate high-capacity models for applications beyond the Web, we need resources with tables that resemble relational database tables. Here we introduce GitTables, a corpus of 1M relational tables extracted from GitHub. Our continuing curation aims at growing the corpus to at least 10M tables. Analyses of GitTables show that its structure, content, and topical coverage differ significantly from existing table corpora. We annotate table columns in GitTables with semantic types, hierarchical relations and descriptions from Schema.org and DBpedia. The evaluation of our annotation pipeline on the T2Dv2 benchmark illustrates that our approach provides results on par with human annotations. We present three applications of GitTables, demonstrating its value for learned semantic type detection models, schema completion methods, and benchmarks for table-to-KG matching, data search, and preparation. We make the corpus and code available at https://gittables.github.io.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3588710

Reference49 articles.

1. Sören Auer , Christian Bizer , Georgi Kobilarov , Jens Lehmann , Richard Cyganiak , and Zachary Ives . 2007. DBpedia: A nucleus for a web of open data. ISWC ( 2007 ), 722--735. Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. DBpedia: A nucleus for a web of open data. ISWC (2007), 722--735.

2. On the Dangers of Stochastic Parrots

3. Methods for exploring and mining tables on Wikipedia

4. Large image datasets: A pyrrhic win for computer vision?

5. Enriching Word Vectors with Subword Information