FUSION: Measuring Binary Function Similarity with Code-Specific Embedding and Order-Sensitive GNN
Author:
Gao Hao,Zhang Tong,Chen Songqiang,Wang Lina,Yu Fajiang
Abstract
Binary code similarity measurement is a popular research area in binary analysis with the recent development of deep learning-based models. Current state-of-the-art methods often use the pre-trained language model (PTLM) to embed instructions into basic blocks as representations of nodes within a control flow graph (CFG). These methods will then use the graph neural network (GNN) to embed the whole CFG and measure the binary similarities between these code embeddings. However, these methods almost directly treat the assembly code as a natural language text and ignore its code-specific features when training PTLM. Moreover, They barely consider the direction of edges in the CFG or consider it less efficient. The weaknesses of the above approaches may limit the performances of previous methods. In this paper, we propose a novel method called function similarity using code-specific PPTs and order-sensitive GNN (FUSION). Since the similarity of binary codes is a symmetric/asymmetric problem, we were guided by the ideas of symmetry and asymmetry in our research. They measure the binary function similarity with two code-specific PTLM training strategies and an order-sensitive GNN, which, respectively, alleviate the aforementioned weaknesses. FUSION outperforms the state-of-the-art binary similarity methods by up to 5.4% in accuracy, and performs significantly better.
Funder
National Natural Science Foundation of China National Key R&D Program of China
Subject
Physics and Astronomy (miscellaneous),General Mathematics,Chemistry (miscellaneous),Computer Science (miscellaneous)
Reference27 articles.
1. Brumley, D., Poosankam, P., Song, D., and Zheng, J. (2008, January 18–21). Automatic patch-based exploit generation is possible: Techniques and implications. Proceedings of the 2008 IEEE Symposium on Security and Privacy (sp 2008), Oakland, CA, USA. 2. Scalable, behavior-based malware clustering;Bayer;NDSS,2009 3. Jang, J., Woo, M., and Brumley, D. (2013, January 14–16). Towards automatic software lineage inference. Proceedings of the 22nd USENIX Security Symposium (USENIX Security 13), Washington, DC, USA. 4. Xu, X., Liu, C., Feng, Q., Yin, H., Song, L., and Song, D.X. (November, January 30). Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection. Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, Dallas, TX, USA. 5. Duan, Y., Li, X., Wang, J., and Yin, H. (2020, January 23–26). Deepbindiff: Learning program-wide code representations for binary diffing. Proceedings of the Network and Distributed System Security Symposium, San Diego, CA, USA.
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Python Open-Source Code Traceability Model Based on Graph Neural Networks;2023 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech);2023-11-14 2. Research on Fault Diagnosis Based on Wide Narrow Convolutions Network;2023 3rd New Energy and Energy Storage System Control Summit Forum (NEESSC);2023-09-26
|
|