Exploring the Potential of Pre-Trained Language Models of Code for Automated Program Repair
-
Published:2024-03-25
Issue:7
Volume:13
Page:1200
-
ISSN:2079-9292
-
Container-title:Electronics
-
language:en
-
Short-container-title:Electronics
Author:
Hao Sichong1, Shi Xianjun1, Liu Hongwei1ORCID
Affiliation:
1. Faculty of Computing, Harbin Institute of Technology, Harbin 150001, China
Abstract
In the realm of software development, automated program repair (APR) emerges as a pivotal technique, autonomously debugging faulty code to boost productivity. Despite the notable advancements of large pre-trained language models of code (PLMCs) in code generation, their efficacy in complex tasks like APR remains suboptimal. This limitation is attributed to the generic development of PLMCs, whose specialized potential for APR is yet be to fully explored. In this paper, we propose a novel approach designed to enhance PLMCs’ APR performance through source code augmentation and curriculum learning. Our approach employs code augmentation operators to generate a spectrum of syntactically varied yet semantically congruent bug-fixing programs, thus enriching the dataset’s diversity. Furthermore, we design a curriculum learning strategy, enabling PLMCs to develop a deep understanding of program semantics from these enriched code variants, thereby refining their APR fine-tuning prowess. We apply our approach across different PLMCs and systematically evaluate it on three benchmarks: BFP-small, BFP-medium, and Defects4J. The experimental results show that our approach outperforms both original models and existing baseline methods, demonstrating the promising future of adapting PLMCs for code debugging in practice.
Funder
National Key Research and Development Program of China
Reference58 articles.
1. Automatic Software Repair: A Bibliography;Monperrus;ACM Comput. Surv.,2018 2. Competition-level code generation with AlphaCode;Li;Science,2022 3. Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., and Jiang, D. (2020, January 16–20). CodeBERT: A Pre-Trained Model for Programming and Natural Languages. Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online. 4. Guo, D., Ren, S., Lu, S., Feng, Z., Tang, D., Liu, S., Zhou, L., Duan, N., Svyatkovskiy, A., and Fu, S. (2021, January 4). GraphCodeBERT: Pre-training Code Representations with Data Flow. Proceedings of the International Conference on Learning Representations, Vienna, Austria. 5. Wang, Y., Wang, W., Joty, S., and Hoi, S.C. (2021, January 7–11). CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Virtual.
|
|