Affiliation:
1. Department of Information Sciences, Tokyo University of Science, Yamazaki, Chiba 278-8510, Japan
Abstract
In recent years, the number of similar software products with many common parts has been increasing due to the reuse and plagiarism of source code in the software development process. Pattern matching, which is an existing method for detecting similarity, cannot detect the similarities between these software products and other programs. It is necessary, for example, to detect similarities based on commonalities in both functionality and control structures. At the same time, detailed software analysis requires manual reverse engineering. Therefore, technologies that automatically identify similarities among the arge amounts of code present in software products in advance can reduce these oads. In this paper, we propose a representation earning model to extract feature expressions from assembly code obtained by statically analyzing such code to determine the similarity between software products. We use assembly code to eliminate the dependence on the existence of source code or differences in development anguage. The proposed approach makes use of Asm2Vec, an existing method, that is capable of generating a vector representation that captures the semantics of assembly code. The proposed method also incorporates information on the program control structure. The control structure can be represented by graph data. Thus, we use graph embedding, a graph vector representation method, to generate a representation vector that reflects both the semantics and the control structure of the assembly code. In our experiments, we generated expression vectors from multiple programs and used clustering to verify the accuracy of the approach in classifying similar programs into the same cluster. The proposed method outperforms existing methods that only consider semantics in both accuracy and execution time.
Subject
Computer Networks and Communications,Human-Computer Interaction
Reference28 articles.
1. (2023, March 24). M-Trends Report. Available online: https://www.mandiant.com/sites/default/files/2021-09/mtrends-2020.pdf.
2. Datasets for Anti-Malware Research;Yuta;SIG Tech. Rep.,2018
3. Similarity Metric CSR Using Code Clone Detection Tool;Yamamoto;Softw. Sci.,2001
4. Experimentally Deriving Probability of Program Piracy based on Length of Code Clone;Okahara;IEICE Tech. Rep.,2008
5. Basit, H.A., Puglisi, S.J., Smyth, W.F., and Turpin, A. (2007, January 3–7). Efficient token based clone detection with flexible tokenization. Proceedings of the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT International Symposium on Foundations of Software Engineering, Dubrovnik, Croatia.