Distributed Representation for Assembly Code

Author:

Yoshida Kazuki1ORCID,Suzuki Kaiyu1ORCID,Matsuzawa Tomofumi1ORCID

Affiliation:

1. Department of Information Sciences, Tokyo University of Science, Yamazaki, Chiba 278-8510, Japan

Abstract

In recent years, the number of similar software products with many common parts has been increasing due to the reuse and plagiarism of source code in the software development process. Pattern matching, which is an existing method for detecting similarity, cannot detect the similarities between these software products and other programs. It is necessary, for example, to detect similarities based on commonalities in both functionality and control structures. At the same time, detailed software analysis requires manual reverse engineering. Therefore, technologies that automatically identify similarities among the arge amounts of code present in software products in advance can reduce these oads. In this paper, we propose a representation earning model to extract feature expressions from assembly code obtained by statically analyzing such code to determine the similarity between software products. We use assembly code to eliminate the dependence on the existence of source code or differences in development anguage. The proposed approach makes use of Asm2Vec, an existing method, that is capable of generating a vector representation that captures the semantics of assembly code. The proposed method also incorporates information on the program control structure. The control structure can be represented by graph data. Thus, we use graph embedding, a graph vector representation method, to generate a representation vector that reflects both the semantics and the control structure of the assembly code. In our experiments, we generated expression vectors from multiple programs and used clustering to verify the accuracy of the approach in classifying similar programs into the same cluster. The proposed method outperforms existing methods that only consider semantics in both accuracy and execution time.

Publisher

MDPI AG

Subject

Computer Networks and Communications,Human-Computer Interaction

Reference28 articles.

1. (2023, March 24). M-Trends Report. Available online: https://www.mandiant.com/sites/default/files/2021-09/mtrends-2020.pdf.

2. Datasets for Anti-Malware Research;Yuta;SIG Tech. Rep.,2018

3. Similarity Metric CSR Using Code Clone Detection Tool;Yamamoto;Softw. Sci.,2001

4. Experimentally Deriving Probability of Program Piracy based on Length of Code Clone;Okahara;IEICE Tech. Rep.,2008

5. Basit, H.A., Puglisi, S.J., Smyth, W.F., and Turpin, A. (2007, January 3–7). Efficient token based clone detection with flexible tokenization. Proceedings of the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT International Symposium on Foundations of Software Engineering, Dubrovnik, Croatia.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3