MalBERTv2: Code Aware BERT-Based Model for Malware Identification-Reference-Cited by-同舟云学术

MalBERTv2: Code Aware BERT-Based Model for Malware Identification

Published:2023-03-24 Issue:2 Volume:7 Page:60
ISSN:2504-2289
Container-title:Big Data and Cognitive Computing
language:en
Short-container-title:BDCC

Author:

Rahali Abir¹^ORCID,Akhloufi Moulay A.¹^ORCID

Affiliation:

1. Perception, Robotics, and Intelligent Machines (PRIME), Department of Computer Science, Université de Moncton, Moncton, NB E1A 3E9, Canada

Abstract

To proactively mitigate malware threats, cybersecurity tools, such as anti-virus and anti-malware software, as well as firewalls, require frequent updates and proactive implementation. However, processing the vast amounts of dataset examples can be overwhelming when relying solely on traditional methods. In cybersecurity workflows, recent advances in natural language processing (NLP) models can aid in proactively detecting various threats. In this paper, we present a novel approach for representing the relevance and significance of the Malware/Goodware (MG) datasets, through the use of a pre-trained language model called MalBERTv2. Our model is trained on publicly available datasets, with a focus on the source code of the apps by extracting the top-ranked files that present the most relevant information. These files are then passed through a pre-tokenization feature generator, and the resulting keywords are used to train the tokenizer from scratch. Finally, we apply a classifier using bidirectional encoder representations from transformers (BERT) as a layer within the model pipeline. The performance of our model is evaluated on different datasets, achieving a weighted f1 score ranging from 82% to 99%. Our results demonstrate the effectiveness of our approach for proactively detecting malware threats using NLP techniques.

Funder

Natural Sciences and Engineering Research Council of Canada

Publisher

MDPI AG

Subject

Artificial Intelligence,Computer Science Applications,Information Systems,Management Information Systems

Link

https://www.mdpi.com/2504-2289/7/2/60/pdf

Reference68 articles.

1. A comparison of static, dynamic, and hybrid analysis for malware detection;Damodaran;J. Comput. Virol. Hacking Tech.,2017

2. Application of deep learning to cybersecurity: A survey;Mahdavifar;Neurocomputing,2019

3. Karita, S., Chen, N., Hayashi, T., Hori, T., Inaguma, H., Jiang, Z., Someki, M., Soplin, N.E.Y., Yamamoto, R., and Wang, X. (2019, January 14–18). A comparative study on transformer vs rnn in speech applications. Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore.

4. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., and Polosukhin, I. (2017, January 4–9). Attention is all you need. Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA. Available online: https://papers.nips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.

5. Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv.

Cited by 9 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Modified CNN with Transfer Learning for Multi-Document Summarization: Proposing Co-Occurrence Matrix Generation-Based Knowledge Extraction;International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems;2024-07

2. A Review of Advancements and Applications of Pre-Trained Language Models in Cybersecurity;2024 12th International Symposium on Digital Forensics and Security (ISDFS);2024-04-29

3. ChatGPT’s applications in marketing: a topic modeling approach;Marketing Intelligence & Planning;2024-03-26

4. Anticipating Threats through Malware Detection Approaches to safeguard Data Privacy and Security: An In-Depth Study;2024 3rd International Conference for Innovation in Technology (INOCON);2024-03-01

5. A Survey of Recent Advances in Deep Learning Models for Detecting Malware in Desktop and Mobile Platforms;ACM Computing Surveys;2024-01-22