On the Reliability and Explainability of Language Models for Program Generation-Reference-Cited by-同舟云学术

On the Reliability and Explainability of Language Models for Program Generation

Published:2024-06-03 Issue:5 Volume:33 Page:1-26
ISSN:1049-331X
Container-title:ACM Transactions on Software Engineering and Methodology
language:en
Short-container-title:ACM Trans. Softw. Eng. Methodol.

Author:

Liu Yue¹^ORCID,Tantithamthavorn Chakkrit¹^ORCID,Liu Yonghui¹^ORCID,Li Li²^ORCID

Affiliation:

1. Monash University, Clayton, Australia

2. Beihang University, Beijing, China

Abstract

Recent studies have adopted pre-trained language models, such as CodeT5 and CodeGPT, for automated program generation tasks like code generation, repair, and translation. Numerous language model based approaches have been proposed and evaluated on various benchmark datasets, demonstrating promising performance. However, there is still uncertainty about the reliability of these models, particularly their realistic ability to consistently transform code sequences. This raises a question: are these techniques sufficiently trustworthy for automated program generation? Consequently, further research is needed to understand model logic and assess reliability and explainability. To bridge these research gaps, we conduct a thorough empirical study of eight popular language models on five representative datasets to determine the capabilities and limitations of automated program generation approaches. We further employ advanced explainable AI approaches to highlight the tokens that significantly contribute to the code transformation. We discover that state-of-the-art approaches suffer from inappropriate performance evaluation stemming from severe data duplication, causing overoptimistic results. Our explainability analysis reveals that, in various experimental scenarios, language models can recognize code grammar and structural information, but they exhibit limited robustness to changes in input sequences. Overall, more rigorous evaluation approaches and benchmarks are critical to enhance the reliability and explainability of automated program generation moving forward. Our findings provide important guidelines for this goal.

Funder

Australian Research Council’s Discovery Early Career Researcher Award

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3641540

Reference89 articles.

1. Google. 2021. Google BigQuery. Retrieved September 2 2022 from https://console.cloud.google.com/marketplace/details/github/github-repos

2. Gerrit Code Review. 2022. Gerrit Code Review Home Page. Retrieved September 2 2022 from https://www.gerritcodereview.com/

3. Unified Pre-training for Program Understanding and Generation

4. Ecco: An Open Source Library for the Explainability of Transformer Language Models

5. The adverse effects of code duplication in machine learning models of code

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Automatically Recommend Code Updates: Are We There Yet?;ACM Transactions on Software Engineering and Methodology;2024-07-16

2. Toward a Theory of Causation for Interpreting Neural Code Models;IEEE Transactions on Software Engineering;2024-05

3. Automated code development based on genetic programming in graphical programming language: A pilot study;PLOS ONE;2024-03-07

4. Using model-driven engineering to automate software language translation;Automated Software Engineering;2024-02-28

5. An Approach for Rapid Source Code Development Based on ChatGPT and Prompt Engineering;IEEE Access;2024