Going beyond API Calls in Dynamic Malware Analysis: A Novel Dataset-Reference-Cited by-同舟云学术

Going beyond API Calls in Dynamic Malware Analysis: A Novel Dataset

Published:2024-09-06 Issue:17 Volume:13 Page:3553
ISSN:2079-9292
Container-title:Electronics
language:en
Short-container-title:Electronics

Author:

Ilić Slaviša¹²^ORCID,Gnjatović Milan²^ORCID,Tot Ivan¹,Jovanović Boriša¹^ORCID,Maček Nemanja³^ORCID,Gavrilović Božović Marijana⁴^ORCID

Affiliation:

1. Department of Military Electronic Engineering, University of Defence, Veljka Lukića Kurjaka 1, 11000 Belgrade, Serbia

2. Department of Information Technology, University of Criminal Investigation and Police Studies, Cara Dušana 196, 11080 Beograd, Serbia

3. School of Electrical and Computer Engineering, Academy of Technical and Art Applied Studies, Vojvode Stepe 283, 11000 Beograd, Serbia

4. Faculty of Engineering, University of Kragujevac, Sestre Janjić 6, 34000 Kragujevac, Serbia

Abstract

Automated sandbox-based analysis systems are dominantly focused on sequences of API calls, which are widely acknowledged as discriminative and easily extracted features. In this paper, we argue that an extension of the feature set beyond API calls may improve the malware detection performance. For this purpose, we apply the Cuckoo open-source sandbox system, carefully configured for the production of a novel dataset for dynamic malware analysis containing 22,200 annotated samples (11,735 benign and 10,465 malware). Each sample represents a full-featured report generated by the Cuckoo sandbox when a corresponding binary file is submitted for analysis. To support our position that the discriminative power of the full-featured sandbox reports is greater than the discriminative power of just API call sequences, we consider samples obtained from binary files whose execution induced API calls. In addition, we derive an additional dataset from samples in the full-featured dataset, whose samples contain only information on API calls. In a three-way factorial design experiment (considering the feature set, the feature representation technique, and the random forest model hyperparameter settings), we trained and tested a set of random forest models in a two-class classification task. The obtained results demonstrate that resorting to full-featured sandbox reports improves malware detection performance. The accuracy of 95.56 percent obtained for API call sequences was increased to 99.74 percent when full-featured sandbox reports were considered.

Publisher

MDPI AG

Link

https://www.mdpi.com/2079-9292/13/17/3553/pdf

Reference33 articles.

1. Malware Analysis by Combining Multiple Detectors and Observation Windows;Ficco;IEEE Trans. Comput.,2022

2. Ho, T.K. (1995, January 14–16). Random decision forests. Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, Canada.

3. Mira, F. (2019, January 1–3). A Review Paper of Malware Detection Using API Call Sequences. Proceedings of the 2019 2nd International Conference on Computer Applications & Information Security (ICCAIS), Riyadh, Saudi Arabia.

4. GRASE: Granulometry Analysis with Semi Eager Classifier to Detect Malware;Deore;Int. J. Interact. Multimed. Artif. Intell.,2024

5. Düzgün, B., Çayır, A., Demirkıran, F., Kahya, C.N., Gençaydın, B., and Dağ, H. (2022). Benchmark Static API Call Datasets for Malware Family Classification. arXiv.