Exploring the Potential of Pre-Trained Language Models of Code for Automated Program Repair-Reference-Cited by-同舟云学术

Exploring the Potential of Pre-Trained Language Models of Code for Automated Program Repair

Published:2024-03-25 Issue:7 Volume:13 Page:1200
ISSN:2079-9292
Container-title:Electronics
language:en
Short-container-title:Electronics

Author:

Hao Sichong¹,Shi Xianjun¹,Liu Hongwei¹^ORCID

Affiliation:

1. Faculty of Computing, Harbin Institute of Technology, Harbin 150001, China

Abstract

In the realm of software development, automated program repair (APR) emerges as a pivotal technique, autonomously debugging faulty code to boost productivity. Despite the notable advancements of large pre-trained language models of code (PLMCs) in code generation, their efficacy in complex tasks like APR remains suboptimal. This limitation is attributed to the generic development of PLMCs, whose specialized potential for APR is yet be to fully explored. In this paper, we propose a novel approach designed to enhance PLMCs’ APR performance through source code augmentation and curriculum learning. Our approach employs code augmentation operators to generate a spectrum of syntactically varied yet semantically congruent bug-fixing programs, thus enriching the dataset’s diversity. Furthermore, we design a curriculum learning strategy, enabling PLMCs to develop a deep understanding of program semantics from these enriched code variants, thereby refining their APR fine-tuning prowess. We apply our approach across different PLMCs and systematically evaluate it on three benchmarks: BFP-small, BFP-medium, and Defects4J. The experimental results show that our approach outperforms both original models and existing baseline methods, demonstrating the promising future of adapting PLMCs for code debugging in practice.

Funder

National Key Research and Development Program of China

Publisher

MDPI AG

Link

https://www.mdpi.com/2079-9292/13/7/1200/pdf

Reference58 articles.

1. Automatic Software Repair: A Bibliography;Monperrus;ACM Comput. Surv.,2018

2. Competition-level code generation with AlphaCode;Li;Science,2022

3. Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., and Jiang, D. (2020, January 16–20). CodeBERT: A Pre-Trained Model for Programming and Natural Languages. Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online.

4. Guo, D., Ren, S., Lu, S., Feng, Z., Tang, D., Liu, S., Zhou, L., Duan, N., Svyatkovskiy, A., and Fu, S. (2021, January 4). GraphCodeBERT: Pre-training Code Representations with Data Flow. Proceedings of the International Conference on Learning Representations, Vienna, Austria.

5. Wang, Y., Wang, W., Joty, S., and Hoi, S.C. (2021, January 7–11). CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Virtual.