Author:
Rybicki Jedrzej,Frenklach Tatiana,Puzis Rami
Abstract
AbstractSample compression using 𝜖-net effectively reduces the number of labeled instances required for accurate classification with nearest neighbor algorithms. However, one-shot construction of an 𝜖-net can be extremely challenging in large-scale distributed data sets. We explore two approaches for distributed sample compression: one where local 𝜖-net is constructed for each data partition and then merged during an aggregation phase, and one where a single backbone of an 𝜖-net is constructed from one partition and aggregates target label distributions from other partitions. Both approaches are applied to the problem of malware detection in a complex, real-world data set of Android apps using the nearest neighbor algorithm. Examination of the compression rate, computational efficiency, and predictive power shows that a single backbone of an 𝜖-net attains favorable performance while achieving a compression rate of 99%.
Funder
Helmholtz-Gemeinschaft
Forschungszentrum Jülich GmbH
Publisher
Springer Science and Business Media LLC
Reference42 articles.
1. Allix K, Bissyandé TF, Klein J et al (2016) AndroZoo: collecting millions of Android apps for the research community. In: Proceedings of the 13th international conference on mining software repositories (MSR’16). ACM, New York, pp 468–471
2. Angiulli F (2005) Fast condensed nearest neighbor rule. In: Proceedings 22nd International Conference on Machine Learning (ICML’05). https://doi.org/10.1145/1102351.1102355. Association for Computing Machinery, New York, pp 25–32
3. AppBrain (2022) Android and Google Play statistics. https://www.appbrain.com/stats, last Accessed: 28 Apr 2022
4. Arp D, Spreitzenbarth M, Hübner M et al (2014) DREBIN: effective and explainable detection of android malware in your pocket. In: Symposium on network and distributed system security (NDSS). https://doi.org/10.14722/ndss.2014.23247. San Diego, Internet Society, pp 1–15
5. Berend D, Kontorovich A (2015) A finite sample analysis of the naive Bayes classifier. J Mach Learn Res 16(44):1519–1545