PPM: Automated Generation of Diverse Programming Problems for Benchmarking Code Generation Models
-
Published:2024-07-12
Issue:FSE
Volume:1
Page:1194-1215
-
ISSN:2994-970X
-
Container-title:Proceedings of the ACM on Software Engineering
-
language:en
-
Short-container-title:Proc. ACM Softw. Eng.
Author:
Chen Simin1ORCID, Feng XiaoNing2ORCID, Han Xiaohong2ORCID, Liu Cong3ORCID, Yang Wei1ORCID
Affiliation:
1. University of Texas, Dallas, USA 2. Taiyuan University of Technology, Taiyuan, China 3. University of California, Riverside, USA
Abstract
In recent times, a plethora of Large Code Generation Models (LCGMs) have been proposed, showcasing significant potential in assisting developers with complex programming tasks. Within the surge of LCGM proposals, a critical aspect of code generation research involves effectively benchmarking the programming capabilities of models.
Benchmarking LCGMs necessitates the creation of a set of diverse programming problems, and each problem comprises the prompt (including the task description), canonical solution, and test inputs. The existing methods for constructing such a problem set can be categorized into two main types: manual methods and perturbation-based methods. However,
%both these methods exhibit major limitations.
%Firstly, manually-based methods require substantial human effort and are not easily scalable. Moreover, programming problem sets created manually struggle to maintain long-term data integrity due to the greedy training data collection mechanism in LCGMs. On the other hand, perturbation-based approaches primarily produce semantically homogeneous problems, resulting in generated programming problems with identical Canonical Solutions to the seed problem. These methods also tend to introduce typos to the prompt, easily detectable by IDEs, rendering them unrealistic.
manual methods demand high effort and lack scalability, while also risking data integrity due to LCGMs' potentially contaminated data collection, and perturbation-based approaches mainly generate semantically homogeneous problems with the same canonical solutions and introduce typos that can be easily auto-corrected by IDE, making them ineffective and unrealistic.
Addressing the aforementioned limitations presents several challenges: (1) How to automatically generate semantically diverse Canonical Solutions to enable comprehensive benchmarking on the models, (2) how to ensure long-term data integrity to prevent data contamination, and (3) how to generate natural and realistic programming problems.
To tackle the first challenge, we draw key insights from viewing a program as a series of mappings from the input to the output domain. These mappings can be transformed, split, reordered, or merged to construct new programs. Based on this insight, we propose programming problem merging, where two existing programming problems are combined to create new ones.
In addressing the second challenge, we incorporate randomness to our programming problem-generation process. Our tool can probabilistically guarantee no data repetition across two random trials.
To tackle the third challenge, we propose the concept of a Lambda Programming Problem, comprising a concise one-sentence task description in natural language accompanied by a corresponding program implementation. Our tool ensures the program prompt is grammatically correct. Additionally, the tool leverages return value type analysis to verify the correctness of newly created Canonical Solutions.
In our empirical evaluation, we utilize our tool on two widely-used datasets and compare it against nine baseline methods using eight code generation models. The results demonstrate the effectiveness of our tool in generating more challenging, diverse, and natural programming problems, comparing to the baselines.
Publisher
Association for Computing Machinery (ACM)
Reference38 articles.
1. Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2019. code2seq: Generating Sequences from Structured Representations of Code. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net. https://openreview.net/forum?id=H1gKYo09tX 2. code2vec: learning distributed representations of code 3. Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, Sujan Kumar Gonugondla, Hantian Ding, Varun Kumar, Nathan Fulton, Arash Farahani, Siddhartha Jain, Robert Giaquinto, Haifeng Qian, Murali Krishna Ramanathan, and Ramesh Nallapati. 2023. Multi-lingual Evaluation of Code Generation Models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https://openreview.net/pdf?id=Bo7eeXm6An8 4. Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. CoRR, abs/2108.07732 (2021), arXiv:2108.07732. arxiv:2108.07732
|
|