Semi-Supervised Classification of Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model Selection

Author:

Eren Maksim E.1ORCID,Bhattarai Manish2ORCID,Joyce Robert J.3ORCID,Raff Edward3ORCID,Nicholas Charles4ORCID,Alexandrov Boian S.2ORCID

Affiliation:

1. Advanced Research in Cyber Systems, Los Alamos National Laboratory, USA

2. Theoretical Division, Los Alamos National Laboratory, USA

3. Machine Learning Research Group, Booz Allen Hamilton, USA

4. Department of Computer Science and Electrical Engineering, University of Maryland Baltimore County, USA

Abstract

Identification of the family to which a malware specimen belongs is essential in understanding the behavior of the malware and developing mitigation strategies. Solutions proposed by prior work, however, are often not practicable due to the lack of realistic evaluation factors. These factors include learning under class imbalance, the ability to identify new malware, and the cost of production-quality labeled data. In practice, deployed models face prominent, rare, and new malware families. At the same time, obtaining a large quantity of up-to-date labeled malware for training a model can be expensive. In this article, we address these problems and propose a novel hierarchical semi-supervised algorithm, which we call the HNMFk Classifier , that can be used in the early stages of the malware family labeling process. Our method is based on non-negative matrix factorization with automatic model selection, that is, with an estimation of the number of clusters. With HNMFk Classifier , we exploit the hierarchical structure of the malware data together with a semi-supervised setup, which enables us to classify malware families under conditions of extreme class imbalance. Our solution can perform abstaining predictions, or rejection option, which yields promising results in the identification of novel malware families and helps with maintaining the performance of the model when a low quantity of labeled data is used. We perform bulk classification of nearly 2,900 both rare and prominent malware families, through static analysis, using nearly 388,000 samples from the EMBER-2018 corpus. In our experiments, we surpass both supervised and semi-supervised baseline models with an F1 score of 0.80.

Funder

Los Alamos National Laboratory (LANL) Laboratory Directed Research and Development

LANL Institutional Computing Program

U.S. Department of Energy National Nuclear Security Administration

Publisher

Association for Computing Machinery (ACM)

Subject

Safety, Risk, Reliability and Quality,General Computer Science

Reference74 articles.

1. Mansour Ahmadi, Dmitry Ulyanov, Stanislav Semenov, Mikhail Trofimov, and Giorgio Giacinto. 2016. Novel feature extraction, selection and fusion for effective malware family classification. In Proceedings of the 6th ACM Conference on Data and Application Security and Privacy. 183–194.

2. Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2623–2631.

3. Source identification by non-negative matrix factorization combined with semi-supervised clustering;Alexandrov B. S.;US Patent S10,776,718,2020

4. Boian S. Alexandrov Ludmil B. Alexandrov Filip L. Iliev Valentin G. Stanev and Velimir V. Vesselinov. 2020. Source identification by non-negative matrix factorization combined with semi-supervised clustering. US Patent S10 776 718.

5. Ludmil B. Alexandrov Jaegil Kim Nicholas J. Haradhvala Mi Ni Huang Alvin Wei Tian Ng Yang Wu Arnoud Boot Kyle R. Covington Dmitry A. Gordenin Erik N. Bergstrom S. M. Ashiqul Islam Nuria Lopez-Bigas Leszek J. Klimczak John R. McPherson Sandro Morganella Radhakrishnan Sabarinathan David A. Wheeler Ville Mustonen Paul Boutros Kin Chan Akihiro Fujimoto Gad Getz Marat Kazanov Michael Lawrence Iñigo Martincorena Hidewaki Nakagawa Paz Polak Stephenie Prokopec Steven A. Roberts Steven G. Rozen Natalie Saini Tatsuhiro Shibata Yuichi Shiraishi Michael R. Stratton Bin Tean Teh Ignacio Vázquez-García Fouad Yousif Willie Yu Lauri A. Aaltonen Federico Abascal Adam Abeshouse Hiroyuki Aburatani David J. Adams Nishant Agrawal Keun Soo Ahn Sung-Min Ahn Hiroshi Aikata Rehan Akbani Kadir C. Akdemir Hikmat Al-Ahmadie Sultan T. Al-Sedairy Fatima Al-Shahrour Malik Alawi Monique Albert Kenneth Aldape Adrian Ally Kathryn Alsop Eva G. Alvarez Fernanda Amary Samirkumar B. Amin Brice Aminou Ole Ammerpohl Matthew J. Anderson Yeng Ang Davide Antonello Pavana Anur Samuel Aparicio Elizabeth L. Appelbaum Yasuhito Arai Axel Aretz Koji Arihiro Shun-ichi Ariizumi Joshua Armenia Laurent Arnould Sylvia Asa Yassen Assenov Gurnit Atwal Sietse Aukema J. Todd Auman Miriam R. R. Aure Philip Awadalla Marta Aymerich Gary D. Bader Adrian Baez-Ortega Matthew H. Bailey Peter J. Bailey Miruna Balasundaram Saianand Balu Pratiti Bandopadhayay Rosamonde E. Banks Stefano Barbi Andrew P. Barbour Jonathan Barenboim Jill Barnholtz- Sloan Hugh Barr Elisabet Barrera John Bartlett Javier Bartolome Claudio Bassi Oliver F. Bathe Daniel Baumhoer Prashant Bavi Stephen B. Baylin Wojciech Bazant Duncan Beardsmore Timothy A. Beck Sam Behjati Andreas Behren Beifang Niu Cindy Bell Sergi Beltran Christopher Benz Andrew Berchuck Anke K. Bergmann Benjamin P. Berman Daniel M. Berney Stephan H. Bernhart Rameen Beroukhim Mario Berrios Samantha Bersani Johanna Bertl Miguel Betancourt Vinayak Bhandari Shriram G. Bhosle Andrew V. Biankin Matthias Bieg Darell Bigner Hans Binder Ewan Birney Michael Birrer Nidhan K. Biswas Bodil Bjerkehagen Tom Bodenheimer Lori Boice Giada Bonizzato Johann S. De Bono Moiz S. Bootwalla Ake Borg Arndt Borkhardt Keith A. Boroevich Ivan Borozan Christoph Borst Marcus Bosenberg Mattia Bosio Jacqueline Boultwood Guillaume Bourque Paul C. Boutros G. Steven Bova David T. Bowen Reanne Bowlby David D. L. Bowtell Sandrine Boyault Rich Boyce Jeffrey Boyd Alvis Brazma Paul Brennan Daniel S. Brewer Arie B. Brinkman Robert G. Bristow Russell R. Broaddus Jane E. Brock Malcolm Brock Annegien Broeks Angela N. Brooks Denise Brooks Benedikt Brors Søren Brunak Timothy J. C. Bruxner Alicia L. Bruzos Alex Buchanan Ivo Buchhalter Christiane Buchholz Susan Bullman Hazel Burke Birgit Burkhardt Kathleen H. Burns John Busanovich Carlos D. Bustamante Adam P. Butler Atul J. Butte Niall J. Byrne Anne-Lise Børresen-Dale Samantha J. Caesar-Johnson Andy Cafferkey Declan Cahill Claudia Calabrese Carlos Caldas Fabien Calvo Niedzica Camacho Peter J. Campbell Elias Campo Cinzia Cantù Shaolong Cao Thomas E. Carey Joana Carlevaro-Fita Rebecca Carlsen Ivana Cataldo Mario Cazzola Jonathan Cebon Robert Cerfolio Dianne E. Chadwick Dimple Chakravarty Don Chalmers Calvin Wing Yiu Chan Michelle Chan-Seng-Yue Vishal S. Chandan David K. Chang Stephen J. Chanock Lorraine A. Chantrill Aurélien Chateigner Nilanjan Chatterjee Kazuaki Chayama Hsiao-Wei Chen Jieming Chen Ken Chen Yiwen Chen Zhaohong Chen Andrew D. Cherniack Jeremy Chien Yoke-Eng Chiew Suet-Feung Chin Juok Cho Sunghoon Cho Jung Kyoon Choi Wan Choi Christine Chomienne Zechen Chong Su Pin Choo Angela Chou Angelika N. Christ Elizabeth L. Christie Eric Chuah Carrie Cibulskis Kristian Cibulskis Sara Cingarlini Peter Clapham Alexander Claviez Sean Cleary Nicole Cloonan Marek Cmero Colin C. Collins Ashton A. Connor Susanna L. Cooke Colin S. Cooper Leslie Cope Vincenzo Corbo Matthew G. Cordes Stephen M. Cordner Isidro Cortés-Ciriano Kyle Covington Prue A. Cowin Brian Craft David Craft Chad J. Creighton Yupeng Cun Erin Curley Ioana Cutcutache Karolina Czajka Bogdan Czerniak Rebecca A. Dagg Ludmila Danilova Maria Vittoria Davi Natalie R. Davidson Helen Davies Ian J. Davis Brandi N. Davis-Dusenbery Kevin J. Dawson Francisco M. De La Vega Ricardo De Paoli-Iseppi Timothy Defreitas Angelo P. Dei Tos Olivier Delaneau John A. Demchok PCAWG Mutational Signatures Working Group and P. C. A. W. G. Consortium. 2020. The repertoire of mutational signatures in human cancer. Nature 578 7793 (01 Feb 2020) 94–101. 10.1038/s41586-020-1943-3

Cited by 1 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Electrical Grid Anomaly Detection via Tensor Decomposition;MILCOM 2023 - 2023 IEEE Military Communications Conference (MILCOM);2023-10-30

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3