Leveraging large language models for data analysis automation-Reference-Cited by-同舟云学术

Leveraging large language models for data analysis automation

Published:2023-12-12 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Jansen Jacqueline A,Manukyan Artür,Al Khoury Nour,Akalin Altuna^ORCID

Abstract

AbstractData analysis is constrained by a shortage of skilled experts, particularly in biology, where detailed data interpretation is vital for understanding complex biological processes and developing new treatments and diagnostics. To address this, we developedmergen, an R package that leverages Large Language Models (LLMs) for data analysis code generation and execution. Our primary goal is to enable humans to conduct data analysis by simply describing their objectives and the desired analyses for specific datasets through clear text. Our approach improves code generation via specialized prompt engineering and error feedback mechanisms. In addition, our system can execute the data analysis workflows prescribed by the LLM providing the results of the data analysis workflow for human review. We evaluated the performance of this data analysis system using various data analysis tasks. Our evaluation revealed that while LLMs effectively generate code for some data analysis tasks, challenges remain in executable code generation, especially for complex data analysis tasks. Our study contributes to a better understanding of LLM capabilities and limitations, providing software infrastructure and practical insights for their effective integration into data analysis workflows.

Publisher

Cold Spring Harbor Laboratory

Reference17 articles.

1. The human side of big data: Understanding the skills of the data scientist in education and industry

2. Data challenges of biomedical researchers in the age of omics;PeerJ,2018

3. Chen M , Tworek J , Jun H , Yuan Q , Pinto HP de O , Kaplan J , et al. Evaluating Large Language Models Trained on Code [Internet]. arXiv.org. 2021. Available from: https://arxiv.org/abs/2107.03374

4. Touvron H , Martin L , Stone K , Albert P , Almahairi A , Babaei Y , et al. Llama 2: Open Foundation and Fine-Tuned Chat Models [Internet]. arXiv.org. 2023. Available from: https://arxiv.org/abs/2307.09288

5. Dakhel AM , Majdinasab V , Nikanjam A , Khomh F , Desmarais MC , Ming Z , et al. GitHub Copilot AI pair programmer: Asset or Liability? [Internet]. arXiv.org. 2022. Available from: https://arxiv.org/abs/2206.15331

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Leveraging LangChain agents to automate data analysis for SaaS.;Artificial Intelligence;2024-06-28

2. Bioinformatics and biomedical informatics with ChatGPT: Year one review;Quantitative Biology;2024-06-27

3. Large Language Model-assisted Clustering and Concept Identification of Engineering Design Data;2024 IEEE Conference on Artificial Intelligence (CAI);2024-06-25

4. Replicating a High-Impact Scientific Publication Using Systems of Large Language Models;2024-04-12