Towards AI-Assisted Synthesis of Verified Dafny Methods-Reference-Cited by-同舟云学术

Towards AI-Assisted Synthesis of Verified Dafny Methods

Published:2024-07-12 Issue:FSE Volume:1 Page:812-835
ISSN:2994-970X
Container-title:Proceedings of the ACM on Software Engineering
language:en
Short-container-title:Proc. ACM Softw. Eng.

Author:

Misu Md Rakib Hossain¹^ORCID,Lopes Cristina V.¹^ORCID,Ma Iris¹^ORCID,Noble James²^ORCID

Affiliation:

1. University of California, Irvine, Irvine, USA

2. Creative Research & Programming, Wellington, New Zealand / Australian National University, Canberra, Australia

Abstract

Large language models show great promise in many domains, including programming. A promise is easy to make but hard to keep, and language models often fail to keep their promises, generating erroneous code. A promising avenue to keep models honest is to incorporate formal verification: generating programs’ specifications as well as code so that the code can be proved correct with respect to the specifications. Unfortunately, existing large language models show a severe lack of proficiency in verified programming. In this paper, we demonstrate how to improve two pretrained models’ proficiency in the Dafny verification-aware language. Using 178 problems from the MBPP dataset, we prompt two contemporary models (GPT-4 and PaLM-2) to synthesize Dafny methods. We use three different types of prompts: a direct Contextless prompt; a Signature prompt that includes a method signature and test cases, and a Chain of Thought (CoT) prompt that decomposes the problem into steps and includes retrieval augmentation generated example problems and solutions. Our results show that GPT-4 performs better than PaLM-2 on these tasks and that both models perform best with the retrieval augmentation generated CoT prompt. GPT-4 was able to generate verified, human-evaluated, Dafny methods for 58% of the problems, however, GPT-4 managed only 19% of the problems with the Contextless prompt, and even fewer (10%) for the Signature prompt. We are thus able to contribute 153 verified Dafny solutions to MBPP problems, 50 that we wrote manually, and 103 synthesized by GPT-4. Our results demonstrate that the benefits of formal program verification are now within reach of code generating large language models. Likewise, program verification systems can benefit from large language models, whether to synthesize code wholesale, to generate specifications, or to act as a "programmer’s verification apprentice", to construct annotations such as loop invariants which are hard for programmers to write or verification tools to find. Finally, we expect that the approach we have pioneered here — generating candidate solutions that are subsequently formally checked for correctness — should transfer to other domains (e.g., legal arguments, transport signaling, structural engineering) where solutions must be correct, where that correctness must be demonstrated, explained and understood by designers and end-users.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3643763

Reference103 articles.

1. Aakanksha Chowdhery et al.. 2022. PaLM: Scaling Language Modeling with Pathways. CoRR abs/2204.02311 (2022).

2. A Survey of Machine Learning for Big Code and Naturalness

3. Amazon. 2023. Automated reasoning. https://www.amazon.science/research-areas/automated-reasoning [Online] [Accessed: 2023-09-20]

4. Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. CoRR, abs/2108.07732 (2021), arxiv:2108.07732

5. Amos Azaria and Tom M. Mitchell. 2023. The Internal State of an LLM Knows When It’s Lying. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, 967–976. https://aclanthology.org/2023.findings-emnlp.68

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Investigating large language models capabilities for automatic code repair in Python;Cluster Computing;2024-05-09

2. Clover: Closed-Loop Verifiable Code Generation;Lecture Notes in Computer Science;2024