ArcheType: A Novel Framework for Open-Source Column Type Annotation Using Large Language Models-Reference-Cited by-同舟云学术

ArcheType: A Novel Framework for Open-Source Column Type Annotation Using Large Language Models

Published:2024-05 Issue:9 Volume:17 Page:2279-2292
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Feuer Benjamin¹,Liu Yurong¹,Hegde Chinmay¹,Freire Juliana¹

Affiliation:

1. New York University

Abstract

Existing deep-learning approaches to semantic column type annotation (CTA) have important shortcomings: they rely on semantic types which are fixed at training time; require a large number of training samples per type; incur high run-time inference costs; and their performance can degrade when evaluated on novel datasets, even when types remain constant. Large language models have exhibited strong zero-shot classification performance on a wide range of tasks and in this paper we explore their use for CTA. We introduce ArcheType, a simple, practical method for context sampling, prompt serialization, model querying, and label remapping, which enables large language models to solve CTA problems in a fully zero-shot manner. We ablate each component of our method separately, and establish that improvements to context sampling and label remapping provide the most consistent gains. ArcheType establishes a new state-of-the-art performance on zero-shot CTA benchmarks (including three new domain-specific benchmarks which we release along with this paper), and when used in conjunction with classical CTA techniques, it outperforms a SOTA DoDuo model on the fine-tuned SOTAB benchmark.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.14778/3665844.3665857

Reference53 articles.

1. Anthropic. 2024. Introducing the next generation of Claude.

2. Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. Dbpedia: A nucleus for a web of open data. In international semantic web conference. Springer, 722--735.

3. Rishi Bommasani Drew A Hudson Ehsan Adeli Russ Altman Simran Arora Sydney von Arx Michael S Bernstein Jeannette Bohg Antoine Bosselut Emma Brunskill et al. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021).

4. Holistic evaluation of language models;Bommasani Rishi;Annals of the New York Academy of Sciences,2023

5. WebTables: Exploring the Power of Tables on the Web;Cafarella Michael J.;Proc. VLDB Endow.,2008