Can GPT-4 Replicate Empirical Software Engineering Research?-Reference-Cited by-同舟云学术

Can GPT-4 Replicate Empirical Software Engineering Research?

Published:2024-07-12 Issue:FSE Volume:1 Page:1330-1353
ISSN:2994-970X
Container-title:Proceedings of the ACM on Software Engineering
language:en
Short-container-title:Proc. ACM Softw. Eng.

Author:

Liang Jenny T.¹^ORCID,Badea Carmen²^ORCID,Bird Christian²^ORCID,DeLine Robert²^ORCID,Ford Denae²^ORCID,Forsgren Nicole²^ORCID,Zimmermann Thomas²^ORCID

Affiliation:

1. Carnegie Mellon University, Pittsburgh, USA

2. Microsoft Research, Redmond, USA

Abstract

Empirical software engineering research on production systems has brought forth a better understanding of the software engineering process for practitioners and researchers alike. However, only a small subset of production systems is studied, limiting the impact of this research. While software engineering practitioners could benefit from replicating research on their own data, this poses its own set of challenges, since performing replications requires a deep understanding of research methodologies and subtle nuances in software engineering data. Given that large language models (LLMs), such as GPT-4, show promise in tackling both software engineering- and science-related tasks, these models could help replicate and thus democratize empirical software engineering research. In this paper, we examine GPT-4’s abilities to perform replications of empirical software engineering research on new data. We specifically study their ability to surface assumptions made in empirical software engineering research methodologies, as well as their ability to plan and generate code for analysis pipelines on seven empirical software engineering papers. We perform a user study with 14 participants with software engineering research expertise, who evaluate GPT-4-generated assumptions and analysis plans (i.e., a list of module specifications) from the papers. We find that GPT-4 is able to surface correct assumptions, but struggles to generate ones that apply common knowledge about software engineering data. In a manual analysis of the generated code, we find that the GPT-4-generated code contains correct high-level logic, given a subset of the methodology. However, the code contains many small implementation-level errors, reflecting a lack of software engineering knowledge. Our findings have implications for leveraging LLMs for software engineering research as well as practitioner data scientists in software teams.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3660767

Reference69 articles.

1. 2023. ChatGPT Plugins. Retrieved September 25 2023 from https://openai.com/blog/chatgpt-plugins#code-interpreter

2. 2023. Standards | Empirical Standards. Retrieved September 25 2023 from https://sigsoft.org/EmpiricalStandards/docs/?standard=Replication

3. The SciQA Scientific Question Answering Benchmark for Scholarly Knowledge

4. Stephen Bach Victor Sanh Zheng Xin Yong Albert Webson Colin Raffel Nihal V. Nayak Abheesht Sharma Taewoon Kim M Saiful Bari Thibault Fevry Zaid Alyafeai Manan Dey Andrea Santilli Zhiqing Sun Srulik Ben-david Canwen Xu Gunjan Chhablani Han Wang Jason Fries Maged Al-shaibani Shanya Sharma Urmish Thakker Khalid Almubarak Xiangru Tang Dragomir Radev Mike Tian-jian Jiang and Alexander Rush. 2022. PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts. In Annual Meeting of the Association for Computational Linguistics (ACL): System Demonstrations. Association for Computational Linguistics 93–104. https://doi.org/10.18653/v1/2022.acl-demo.9 10.18653/v1/2022.acl-demo.9

5. Towards a theory of software development expertise