Affiliation:
1. Security Research Centre, Concordia University Montreal, Quebec, Canada
Abstract
Identifying free open-source software (FOSS) packages on binaries when the source code is unavailable is important for many security applications, such as malware detection, software infringement, and digital forensics. This capability enhances both the accuracy and the efficiency of reverse engineering tasks by avoiding false correlations between irrelevant code bases. Although the FOSS package identification problem belongs to the field of software engineering, conventional approaches rely strongly on practical methods in data mining and database searching. However, various challenges in the use of these methods prevent existing function identification approaches from being effective in the absence of source code. To make matters worse, the introduction of obfuscation techniques, the use of different compilers and compilation settings, and software refactoring techniques has made the automated detection of FOSS packages increasingly difficult. With very few exceptions, the existing systems are not resilient to such techniques, and the exceptions are not sufficiently efficient.
To address this issue, we propose
FOSSIL
, a novel resilient and efficient system that incorporates three components. The first component extracts the syntactical features of functions by considering opcode frequencies and applying a hidden Markov model statistical test. The second component applies a neighborhood hash graph kernel to random walks derived from control-flow graphs, with the goal of extracting the semantics of the functions. The third component applies z-score to the normalized instructions to extract the behavior of instructions in a function. The components are integrated using a Bayesian network model, which synthesizes the results to determine the FOSS function. The novel approach of combining these components using the Bayesian network has produced stronger resilience to code obfuscation.
We evaluate our system on three datasets, including real-world projects whose use of FOSS packages is known, malware binaries for which there are security and reverse engineering reports purporting to describe their use of FOSS, and a large repository of malware binaries. We demonstrate that our system is able to identify FOSS packages in real-world projects with a mean precision of 0.95 and with a mean recall of 0.85. Furthermore,
FOSSIL
is able to discover FOSS packages in malware binaries that match those listed in security and reverse engineering reports. Our results show that modern malware binaries contain 0.10--0.45 of FOSS packages.
Publisher
Association for Computing Machinery (ACM)
Subject
Safety, Risk, Reliability and Quality,General Computer Science
Reference88 articles.
1. 2012. Full Analysis of Flame’s Command 8 Control servers. Retrieved from https://securelist.com/blog/incidents/34216/full-analysis-of-flames-command-control-servers-27/. 2012. Full Analysis of Flame’s Command 8 Control servers. Retrieved from https://securelist.com/blog/incidents/34216/full-analysis-of-flames-command-control-servers-27/.
2. 2016. Script modifies GNU assembly files (.s) to confuse linear sweep disassemblers like objdump. It does not confuse recursive traversal disassemblers like IDA Pro. It is very inefficient making simple code about 2x slower. Retrieved from https://github.com/defuse/gas-obfuscation. 2016. Script modifies GNU assembly files (.s) to confuse linear sweep disassemblers like objdump. It does not confuse recursive traversal disassemblers like IDA Pro. It is very inefficient making simple code about 2x slower. Retrieved from https://github.com/defuse/gas-obfuscation.
3. 2016. The Lintian Reports. Retrieved from https://lintian.debian.org. 2016. The Lintian Reports. Retrieved from https://lintian.debian.org.
4. 2016. The Paradyn project. Retrieved from http://www.paradyn.org/html/dyninst9.0.0-features.html. 2016. The Paradyn project. Retrieved from http://www.paradyn.org/html/dyninst9.0.0-features.html.
5. 2016. The tracelet system. Retrieved from https://github.com/Yanivmd/TRACY. 2016. The tracelet system. Retrieved from https://github.com/Yanivmd/TRACY.
Cited by
38 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. CEBin: A Cost-Effective Framework for Large-Scale Binary Code Similarity Detection;Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis;2024-09-11
2. BinCodex: A comprehensive and multi-level dataset for evaluating binary code similarity detection techniques;BenchCouncil Transactions on Benchmarks, Standards and Evaluations;2024-06
3. Identifying Authorship in Malicious Binaries: Features, Challenges & Datasets;ACM Computing Surveys;2024-04-30
4. Broad learning: A GPU-free image-based malware classification;Applied Soft Computing;2024-03
5. AdvBinSD: Poisoning the Binary Code Similarity Detector via Isolated Instruction Sequences;2023 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom);2023-12-21