Assessing effectiveness of test suites: what do we know and what should we do?-Reference-Cited by-同舟云学术

Assessing effectiveness of test suites: what do we know and what should we do?

Published:2023-12-05 Issue: Volume: Page:
ISSN:1049-331X
Container-title:ACM Transactions on Software Engineering and Methodology
language:en
Short-container-title:ACM Trans. Softw. Eng. Methodol.

Author:

Zhang Peng¹,Wang Yang²,Liu Xutong²,Lu Zeyu²,Yang Yibiao²,Li Yanhui²,Chen Lin²,Wang Ziyuan³,Sun Chang-ai⁴,Yu Xiao⁵,Zhou Yuming²

Affiliation:

1. State Key Laboratory for Novel Software Technology, Nanjing University, China and Huawei Technologies Co., Ltd, China

2. State Key Laboratory for Novel Software Technology, Nanjing University, China

3. Nanjing University of Posts and Telecommunications, China

4. University of Science and Technology Beijing, China

5. Huawei Technologies Co., Ltd, China

Abstract

Background. Software testing is a critical activity for ensuring the quality and reliability of software systems. To evaluate the effectiveness of different test suites, researchers have developed a variety of metrics. Problem. However, comparing these metrics is challenging due to the lack of a standardized evaluation framework including comprehensive factors. As a result, researchers often focus on single factors (e.g., size), which finally leads to different or even contradictory conclusions. After comparing dozens of pieces of work in detail, we have found two main problems most troubling to our community: (1) researchers tend to oversimplify the description of the ground truth they use, and (2) data involving real defects is not suitable for analysis using traditional statistical indicators. Objective. We aim at scrutinizing the whole process of comparing test suites for our community. Method. To hit this aim, we propose a framework ASSENT (ev A luating te S t S uite E ffective N ess me T rics) to guide the follow-up research for evaluating a test suite effectiveness metric. ASSENT consists of three fundamental components: ground truth, benchmark test suites, and agreement indicator. Its functioning is as follows: first, users clarify the ground truth for determining the real order in effectiveness among test suites. Second, users generate a set of benchmark test suites and derive their ground truth order in effectiveness. Third, users use the metric to derive the order in effectiveness for the same test suites. Finally, users calculate the agreement indicator between the two orders derived by two metrics. Result. With ASSENT, we are able to compare the accuracy of different test suite effectiveness metrics. We apply ASSENT to evaluate representative test suite effectiveness metrics, including mutation score and code coverage metrics. Our results show that, based on the real faults, mutation score and subsuming mutation score are the best metrics to quantify test suite effectiveness. Meanwhile, by using mutants instead of real faults, test effectiveness will be overestimated by more than 20% in values. Conclusion. We recommend that the standardized evaluation framework ASSENT should be used for evaluating and comparing test effectiveness metrics in the future work.

Publisher

Association for Computing Machinery (ACM)

Subject

Software

Link

https://dl.acm.org/doi/pdf/10.1145/3635713

Reference48 articles.

1. [n. d.].. https://github.com/zhangpengNJU/ASSENT/README.md [n. d.].. https://github.com/zhangpengNJU/ASSENT/README.md

2. [n. d.].. https://github.com/cobertura [n. d.].. https://github.com/cobertura

3. Can testedness be effectively measured?

4. Paul Ammann , Marcio Eduardo Delamaro , and Jeff Offutt . 2014. Establishing theoretical minimal sets of mutants. In 2014 IEEE seventh international conference on software testing, verification and validation . IEEE , 21–30. Paul Ammann, Marcio Eduardo Delamaro, and Jeff Offutt. 2014. Establishing theoretical minimal sets of mutants. In 2014 IEEE seventh international conference on software testing, verification and validation. IEEE, 21–30.

5. Thomas Ball . 2004 . A theory of predicate-complete test coverage and generation . In International Symposium on Formal Methods for Components and Objects. Springer, 1–22 . Thomas Ball. 2004. A theory of predicate-complete test coverage and generation. In International Symposium on Formal Methods for Components and Objects. Springer, 1–22.