Can Large Language Models Predict Data Correlations from Column Names?-Reference-Cited by-同舟云学术

Can Large Language Models Predict Data Correlations from Column Names?

Published:2023-09 Issue:13 Volume:16 Page:4310-4323
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Trummer Immanuel¹

Affiliation:

1. Cornell Database Group, Ithaca, NY, USA

Abstract

Recent publications suggest using natural language analysis on database schema elements to guide tuning and profiling efforts. The underlying hypothesis is that state-of-the-art language processing methods, so-called language models, are able to extract information on data properties from schema text. This paper examines that hypothesis in the context of data correlation analysis: is it possible to find column pairs with correlated data by analyzing their names via language models? First, the paper introduces a novel benchmark for data correlation analysis, created by analyzing thousands of Kaggle data sets (and available for download). Second, it uses that data to study the ability of language models to predict correlation, based on column names. The analysis covers different language models, various correlation metrics, and a multitude of accuracy metrics. It pinpoints factors that contribute to successful predictions, such as the length of column names as well as the ratio of words. Finally, the study analyzes the impact of column types on prediction performance. The results show that schema text can be a useful source of information and inform future research efforts, targeted at NLP-enhanced database tuning and data profiling.

Publisher

Association for Computing Machinery (ACM)

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/3625054.3625066

Reference62 articles.

1. Detecting unique column combinations on dynamic data

2. Natural language interfaces to databases

3. Bravais-Pearson and Spearman correlation coefficients: meaning, test of hypothesis and confidence interval

4. PG Brown and PJ Hass . 2003 . BHUNT: Automatic discovery of fuzzy algebraic constraints in relational data. In VLDB. 668--679 . http://dl.acm.org/citation.cfm?id=1315509 PG Brown and PJ Hass. 2003. BHUNT: Automatic discovery of fuzzy algebraic constraints in relational data. In VLDB. 668--679. http://dl.acm.org/citation.cfm?id=1315509

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Chameleon: Foundation Models for Fairness-Aware Multi-Modal Data Augmentation to Enhance Coverage of Minorities;Proceedings of the VLDB Endowment;2024-07

2. Large Language Models: Principles and Practice;2024 IEEE 40th International Conference on Data Engineering (ICDE);2024-05-13

3. DB-BERT: making database tuning tools “read” the manual;The VLDB Journal;2023-12-27