Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes-Reference-Cited by-同舟云学术

Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes

Published:2023-10 Issue:2 Volume:17 Page:92-105
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Arora Simran¹,Yang Brandon¹,Eyuboglu Sabri¹,Narayan Avanika¹,Hojel Andrew¹,Trummer Immanuel²,Ré Christopher¹

Affiliation:

1. Stanford University

2. Cornell University

Abstract

A long standing goal in the data management community is developing systems that input documents and output queryable tables without user effort. Given the sheer variety of potential documents, state-of-the art systems make simplifying assumptions and use domain specific training. In this work, we ask whether we can maintain generality by using the in-context learning abilities of large language models (LLMs). We propose and evaluate Evaporate, a prototype system powered by LLMs. We identify two strategies for implementing this system: prompt the LLM to directly extract values from documents or prompt the LLM to synthesize code that performs the extraction. Our evaluations show a cost-quality tradeoff between these two approaches. Code synthesis is cheap, but far less accurate than directly processing each document with the LLM. To improve quality while maintaining low cost, we propose an extended implementation, Evaporate-Code+, which achieves better quality than direct extraction. Our insight is to generate many candidate functions and ensemble their extractions using weak supervision. Evaporate-Code+ outperforms the state-of-the art systems using a sublinear pass over the documents with the LLM. This equates to a 110X reduction in the number of documents the LLM needs to process across our 16 real-world evaluation settings.

Publisher

Association for Computing Machinery (ACM)

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/3626292.3626294

Reference68 articles.

1. April 2023. Wikipedia Statistics. https://en.wikipedia.org/wiki/Special:Statistics April 2023. Wikipedia Statistics. https://en.wikipedia.org/wiki/Special:Statistics

2. Large language models are few-shot clinical information extractors

3. Simran Arora , Patrick Lewis , Angela Fan , Jacob Kahn , and Christopher Ré. 2023. Reasoning over Public and Private Data in Retrieval-Based Systems. Transactions of Computational Linguistics (TACL) ( 2023 ). Simran Arora, Patrick Lewis, Angela Fan, Jacob Kahn, and Christopher Ré. 2023. Reasoning over Public and Private Data in Retrieval-Based Systems. Transactions of Computational Linguistics (TACL) (2023).

4. Simran Arora , Avanika Narayan , Mayee F. Chen , Laurel Orr , Neel Guha , Kush Bhatia , Ines Chami , Frederic Sala , and Christopher Ré . 2023 . Ask Me Anything: A simple strategy for prompting language models . International Conference on Learning Representations (ICLR) (2023). Simran Arora, Avanika Narayan, Mayee F. Chen, Laurel Orr, Neel Guha, Kush Bhatia, Ines Chami, Frederic Sala, and Christopher Ré. 2023. Ask Me Anything: A simple strategy for prompting language models. International Conference on Learning Representations (ICLR) (2023).

5. Simran Arora Brandon Yang Sabri Eyuboglu Avanika Narayan Andrew Hojel Immanuel Trummer and Christopher Ré. 2023. Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes. (2023). https://www.dropbox.com/scl/fi/3gt3ixdbvp986ptyz5j4t/VLDB_Revision.pdf?rlkey=mxi2kqp7rqx0frm9s7bpttwcq&dl=0 Simran Arora Brandon Yang Sabri Eyuboglu Avanika Narayan Andrew Hojel Immanuel Trummer and Christopher Ré. 2023. Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes. (2023). https://www.dropbox.com/scl/fi/3gt3ixdbvp986ptyz5j4t/VLDB_Revision.pdf?rlkey=mxi2kqp7rqx0frm9s7bpttwcq&dl=0

Cited by 11 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. BC4LLM: A perspective of trusted artificial intelligence when blockchain meets large language models;Neurocomputing;2024-09

2. Automated Mining of Structured Knowledge from Text in the Era of Large Language Models;Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining;2024-08-24

3. Construction of Knowledge Graphs: Current State and Challenges;Information;2024-08-22

4. To prompt or not to prompt: Navigating the use of Large Language Models for integrating and modeling heterogeneous data;Data & Knowledge Engineering;2024-07

5. DevSec-GPT — Generative-AI (with Custom-Trained Meta's Llama2 LLM), Blockchain, NFT and PBOM Enabled Cloud Native Container Vulnerability Management and Pipeline Verification Platform;2024 IEEE Cloud Summit;2024-06-27