Revealing the Unseen: AI Chain on LLMs for Predicting Implicit Data Flows to Generate Data Flow Graphs in Dynamically-Typed Code-Reference-Cited by-同舟云学术

Revealing the Unseen: AI Chain on LLMs for Predicting Implicit Data Flows to Generate Data Flow Graphs in Dynamically-Typed Code

Published:2024-06-12 Issue: Volume: Page:
ISSN:1049-331X
Container-title:ACM Transactions on Software Engineering and Methodology
language:en
Short-container-title:ACM Trans. Softw. Eng. Methodol.

Author:

Huang Qing¹^ORCID,Luo Zhiwen¹^ORCID,Xing Zhenchang²^ORCID,Zeng Jinshan¹^ORCID,Chen Jieshan³^ORCID,Xu Xiwei³^ORCID,Chen Yong¹^ORCID

Affiliation:

1. Jiangxi Normal University, School of Computer Information Engineering, China

2. CSIRO’s Data61 & Australian National University, College of Engineering and Computer Science, Australia

3. CSIRO’s Data61, Australia

Abstract

Data flow graphs (DFGs) capture definitions (defs) and uses across program blocks, which is a fundamental program representation for program analysis, testing and maintenance. However, dynamically-typed programming languages like Python present implicit data flow issues that make it challenging to determine def-use flow information at compile time. Static analysis methods like Soot and WALA are inadequate for handling these issues, and manually enumerating comprehensive heuristic rules is impractical. Large pre-trained language models (LLMs) offer a potential solution, as they have powerful language understanding and pattern matching abilities, allowing them to predict implicit data flow by analyzing code context and relationships between variables, functions, and statements in code. We propose leveraging LLMs’ in-context learning ability to learn implicit rules and patterns from code representation and contextual information to solve implicit data flow problems. To further enhance the accuracy of LLMs, we design a five-step Chain of Thought (CoT) and break it down into an AI chain, with each step corresponding to a separate AI unit to generate accurate DFGs for Python code. Our approach’s performance is thoroughly assessed, demonstrating the effectiveness of each AI unit in the AI Chain. Compared to static analysis, our method achieves 82% higher def coverage and 58% higher use coverage in DFG generation on implicit data flow. We also prove the indispensability of each unit in the AI Chain. Overall, our approach offers a promising direction for building software engineering tools by utilizing foundation models, eliminating significant engineering and maintenance effort, but focusing on identifying problems for AI to solve.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3672458

Reference78 articles.

1. Hemant D Pande and William Landi. Interprocedural def-use associations in c programs. In Proceedings of the symposium on Testing, analysis, and verification, pages 139–153, 1991.

2. Jan Midtgaard. Control-flow analysis of functional programs. ACM computing surveys (CSUR), 44(3):1–33, 2012.

3. Rijwan Khan and Akhilesh Kumar Srivastava. Automatic software testing framework for all def-use with genetic algorithm. Int J Innov Technol Explor Eng (IJITEE), 8(8):2055–2060, 2019.

4. Ting Su, Ke Wu, Weikai Miao, Geguang Pu, Jifeng He, Yuting Chen, and Zhendong Su. A survey on data-flow testing. ACM Computing Surveys (CSUR), 50(1):1–35, 2017.

5. Zoltán Ujhelyi and Dániel Varró. Def-use analysis of model transformation programs with program slicing. In 18th PhD Mini-Symposium, pages 46–49. Budapest University of Technology and Economics, 2011.