Predicting unstable software benchmarks using static source code features

Author:

Laaber ChristophORCID,Basmaci Mikael,Salza PasqualeORCID

Abstract

AbstractSoftware benchmarks are only as good as the performance measurements they yield. Unstable benchmarks show high variability among repeated measurements, which causes uncertainty about the actual performance and complicates reliable change assessment. However, if a benchmark is stable or unstable only becomes evident after it has been executed and its results are available. In this paper, we introduce a machine-learning-based approach to predict a benchmark’s stability without having to execute it. Our approach relies on 58 statically-computed source code features, extracted for benchmark code and code called by a benchmark, related to (1) meta information, e.g., lines of code (LOC), (2) programming language elements, e.g., conditionals or loops, and (3) potentially performance-impacting standard library calls, e.g., file and network input/output (I/O). To assess our approach’s effectiveness, we perform a large-scale experiment on 4,461 Go benchmarks coming from 230 open-source software (OSS) projects. First, we assess the prediction performance of our machine learning models using 11 binary classification algorithms. We find that Random Forest performs best with good prediction performance from 0.79 to 0.90, and 0.43 to 0.68, in terms of AUC and MCC, respectively. Second, we perform feature importance analyses for individual features and feature categories. We find that 7 features related to meta-information, slice usage, nested loops, and synchronization application programming interfaces (APIs) are individually important for good predictions; and that the combination of all features of the called source code is paramount for our model, while the combination of features of the benchmark itself is less important. Our results show that although benchmark stability is affected by more than just the source code, we can effectively utilize machine learning models to predict whether a benchmark will be stable or not ahead of execution. This enables spending precious testing time on reliable benchmarks, supporting developers to identify unstable benchmarks during development, allowing unstable benchmarks to be repeated more often, estimating stability in scenarios where repeated benchmark execution is infeasible or impossible, and warning developers if new benchmarks or existing benchmarks executed in new environments will be unstable.

Funder

Universität Zürich

Publisher

Springer Science and Business Media LLC

Subject

Software

Reference135 articles.

1. Abedi A, Brecht T (2017) Conducting repeatable experiments in highly variable cloud computing environments. In: Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering, ICPE, vol 2017. ACM, New York, pp 287–292. https://doi.org/10.1145/3030207.3030229

2. Akinshin A (2020a) Quantile confidence intervals for weighted samples. https://aakinshin.net/posts/weighted-quantiles-ci/, accessed: 2.2. 2021

3. Akinshin A (2020b) Quantile-respectful density estimation based on the Harrell-Davis, quantile estimator. https://aakinshin.net/posts/qrde-hd/, accessed: 2.2. 2021

4. Akinshin A (2021) Unbiased median absolute deviation. https://aakinshin.net/posts/unbiased-mad/, accessed: 9.2.2021

5. Alam MMu, Liu T, Zeng G, Muzahid A (2017) SyncPerf: Categorizing, detecting, and diagnosing synchronization performance bugs. In: Proceedings of the 12th European Conference on Computer Systems, EuroSys. ACM, New York, pp 298–313. https://doi.org/10.1145/3064176.3064186

Cited by 17 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Evaluating Search-Based Software Microbenchmark Prioritization;IEEE Transactions on Software Engineering;2024-07

2. Enhancing Performance Bug Prediction Using Performance Code Metrics;Proceedings of the 21st International Conference on Mining Software Repositories;2024-04-15

3. A longitudinal study on the temporal validity of software samples;Information and Software Technology;2024-04

4. What makes a real change in software performance? An empirical study on analyzing the factors that affect the triagement of performance change points;Science of Computer Programming;2024-03

5. Studying the association between Gitcoin’s issues and resolving outcomes;Journal of Systems and Software;2023-12

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3