Chorus: Foundation Models for Unified Data Discovery and Exploration

Author:

Kayali Moe1,Lykov Anton1,Fountalis Ilias2,Vasiloglou Nikolaos2,Olteanu Dan3,Suciu Dan1

Affiliation:

1. University of Washington

2. RelationalAI

3. University of Zurich

Abstract

We apply foundation models to data discovery and exploration tasks. Foundation models are large language models (LLMS) that show promising performance on a range of diverse tasks unrelated to their training. We show that these models are highly applicable to the data discovery and data exploration domain. When carefully used, they have superior capability on three representative tasks: table-class detection, column-type annotation and join-column prediction. On all three tasks, we show that a foundation-model-based approach outperforms the task-specific models and so the state of the art. Further, our approach often surpasses human-expert task performance. We investigate the fundamental characteristics of this approach including generalizability to several foundation models and the impact of non-determinism on the outputs. All in all, this suggests a future direction in which disparate data management tasks can be unified under foundation models.

Publisher

Association for Computing Machinery (ACM)

Reference67 articles.

1. Nora Abdelmageed Jiaoyan Chen Vincenzo Cutrona Vasilis Efthymiou Oktie Hassanzadeh Madelon Hulsebos Ernesto Jiménez-Ruiz Juan Sequeda and Kavitha Srinivas. Results of semtab 2022. In Vasilis Efthymiou Ernesto Jiménez-Ruiz Jiaoyan Chen Vincenzo Cutrona Oktie Hassanzadeh Juan Sequeda Kavitha Srinivas Nora Abdelmageed and Madelon Hulsebos editors Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching SemTab 2021 co-located with the 21st International Semantic Web Conference ISWC 2022 Virtual conference October 23--27 2022 volume 3320 of CEUR Workshop Proceedings pages 1--13. CEUR-WS.org 2022. URL https://ceur-ws.org/Vol-3320/paper0.pdf.

2. Inc.Anaconda. State of data science. https://www.anaconda.com/resources/whitepapers/state-of-data-science-2021, July 2021.

3. Language Models as Agent Models

4. Simran Arora, Brandon Yang, Sabri Eyuboglu, Avanika Narayan, Andrew Hojel, Immanuel Trummer, and Christopher Ré. Language models enable simple systems for generating structured views of heterogeneous data lakes, 2023.

5. Rishi Bommasani Drew A. Hudson Ehsan Adeli Russ B. Altman Simran Arora Sydney von Arx Michael S. Bernstein Jeannette Bohg Antoine Bosselut Emma Brunskill Erik Brynjolfsson Shyamal Buch Dallas Card Rodrigo Castellon Niladri S. Chatterji Annie S. Chen Kathleen Creel Jared Quincy Davis Dorottya Demszky Chris Donahue Moussa Doumbouya Esin Durmus Stefano Ermon John Etchemendy Kawin Ethayarajh Li Fei-Fei Chelsea Finn Trevor Gale Lauren Gillespie Karan Goel Noah D. Goodman Shelby Grossman Neel Guha Tatsunori Hashimoto Peter Henderson John Hewitt Daniel E. Ho Jenny Hong Kyle Hsu Jing Huang Thomas Icard Saahil Jain Dan Jurafsky Pratyusha Kalluri Siddharth Karamcheti Geoff Keeling Fereshte Khani Omar Khattab Pang Wei Koh Mark S. Krass Ranjay Krishna Rohith Kuditipudi and et al. On the opportunities and risks of foundation models. CoRR abs/2108.07258 2021. URL https://arxiv.org/abs/2108.07258.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3