Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues-Reference-Cited by-同舟云学术

Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues

Published:2024-06-04 Issue:5 Volume:33 Page:1-26
ISSN:1049-331X
Container-title:ACM Transactions on Software Engineering and Methodology
language:en
Short-container-title:ACM Trans. Softw. Eng. Methodol.

Author:

Liu Yue¹^ORCID,Le-Cong Thanh²^ORCID,Widyasari Ratnadira³^ORCID,Tantithamthavorn Chakkrit⁴^ORCID,Li Li⁵^ORCID,Le Xuan-Bach D.²^ORCID,Lo David³^ORCID

Affiliation:

1. Monash University, Clayton, Australia and Singapore Management University, Singapore, Singapore

2. The University of Melbourne, Melbourne, Australia

3. Singapore Management University, Singapore, Singapore

4. Monash University, Clayton, Australia

5. Beihang University, Beijing, China

Abstract

Since its introduction in November 2022, ChatGPT has rapidly gained popularity due to its remarkable ability in language understanding and human-like responses. ChatGPT, based on GPT-3.5 architecture, has shown great promise for revolutionizing various research fields, including code generation. However, the reliability and quality of code generated by ChatGPT remain unexplored, raising concerns about potential risks associated with the widespread use of ChatGPT-driven code generation. In this article, we systematically study the quality of 4,066 ChatGPT-generated programs of code implemented in two popular programming languages, i.e., Java and Python, for 2,033 programming tasks. The goal of this work is threefold. First, we analyze the correctness of ChatGPT on code generation tasks and uncover the factors that influence its effectiveness, including task difficulty, programming language, time that tasks are introduced, and program size. Second, we identify and characterize potential issues with the quality of ChatGPT-generated code. Last, we provide insights into how these issues can be mitigated. Experiments highlight that out of 4,066 programs generated by ChatGPT, 2,756 programs are deemed correct, 1,082 programs provide wrong outputs, and 177 programs contain compilation or runtime errors. Additionally, we further analyze other characteristics of the generated code through static analysis tools, such as code style and maintainability, and find that 1,930 ChatGPT-generated code snippets suffer from maintainability issues. Subsequently, we investigate ChatGPT’s self-repairing ability and its interaction with static analysis tools to fix the errors uncovered in the previous step. Experiments suggest that ChatGPT can partially address these challenges, improving code quality by more than 20%, but there are still limitations and opportunities for improvement. Overall, our study provides valuable insights into the current limitations of ChatGPT and offers a roadmap for future research and development efforts to enhance the code generation capabilities of artificial intelligence models such as ChatGPT.

Funder

National Research Foundation

Australian Research Council’s Discovery Early Career Researcher Award

Australian Government through the Australian Research Council’s Discovery Early Career Researcher Award

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3643674

Reference63 articles.

1. Few-shot training LLMs for project-specific code-summarization

2. Amazon. 2023. Amazon CodeWhisperer. Retrieved from https://aws.amazon.com/codewhisperer/

3. Program synthesis with large language models;Austin Jacob;arXiv preprint arXiv:2108.07732,2021

4. Anonymous. 2023. Replication Package for Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues. Retrieved from https://github.com/yueyueL/ChatGPT-CodeGenAnalysis

5. How android app developers manage power consumption?

Cited by 14 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. When to Stop? Towards Efficient Code Generation in LLMs with Excess Token Prevention;Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis;2024-09-11

2. The Current State of Generative Artificial Intelligence Tools for Accessibility in Product Development;Nafath;2024-07-30

3. Chain of Targeted Verification Questions to Improve the Reliability of Code Generated by LLMs;Proceedings of the 1st ACM International Conference on AI-Powered Software;2024-07-10

4. Students' Perspectives on AI Code Completion: Benefits and Challenges;2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC);2024-07-02

5. LUNA: A Model-Based Universal Analysis Framework for Large Language Models;IEEE Transactions on Software Engineering;2024-07