Assessing effectiveness of test suites: what do we know and what should we do?

Author:

Zhang Peng1,Wang Yang2,Liu Xutong2,Lu Zeyu2,Yang Yibiao2,Li Yanhui2,Chen Lin2,Wang Ziyuan3,Sun Chang-ai4,Yu Xiao5,Zhou Yuming2

Affiliation:

1. State Key Laboratory for Novel Software Technology, Nanjing University, China and Huawei Technologies Co., Ltd, China

2. State Key Laboratory for Novel Software Technology, Nanjing University, China

3. Nanjing University of Posts and Telecommunications, China

4. University of Science and Technology Beijing, China

5. Huawei Technologies Co., Ltd, China

Abstract

Background. Software testing is a critical activity for ensuring the quality and reliability of software systems. To evaluate the effectiveness of different test suites, researchers have developed a variety of metrics. Problem. However, comparing these metrics is challenging due to the lack of a standardized evaluation framework including comprehensive factors. As a result, researchers often focus on single factors (e.g., size), which finally leads to different or even contradictory conclusions. After comparing dozens of pieces of work in detail, we have found two main problems most troubling to our community: (1) researchers tend to oversimplify the description of the ground truth they use, and (2) data involving real defects is not suitable for analysis using traditional statistical indicators. Objective. We aim at scrutinizing the whole process of comparing test suites for our community. Method. To hit this aim, we propose a framework ASSENT (ev A luating te S t S uite E ffective N ess me T rics) to guide the follow-up research for evaluating a test suite effectiveness metric. ASSENT consists of three fundamental components: ground truth, benchmark test suites, and agreement indicator. Its functioning is as follows: first, users clarify the ground truth for determining the real order in effectiveness among test suites. Second, users generate a set of benchmark test suites and derive their ground truth order in effectiveness. Third, users use the metric to derive the order in effectiveness for the same test suites. Finally, users calculate the agreement indicator between the two orders derived by two metrics. Result. With ASSENT, we are able to compare the accuracy of different test suite effectiveness metrics. We apply ASSENT to evaluate representative test suite effectiveness metrics, including mutation score and code coverage metrics. Our results show that, based on the real faults, mutation score and subsuming mutation score are the best metrics to quantify test suite effectiveness. Meanwhile, by using mutants instead of real faults, test effectiveness will be overestimated by more than 20% in values. Conclusion. We recommend that the standardized evaluation framework ASSENT should be used for evaluating and comparing test effectiveness metrics in the future work.

Publisher

Association for Computing Machinery (ACM)

Subject

Software

Reference48 articles.

1. [n. d.].. https://github.com/zhangpengNJU/ASSENT/README.md [n. d.].. https://github.com/zhangpengNJU/ASSENT/README.md

2. [n. d.].. https://github.com/cobertura [n. d.].. https://github.com/cobertura

3. Can testedness be effectively measured?

4. Paul Ammann , Marcio Eduardo Delamaro , and Jeff Offutt . 2014. Establishing theoretical minimal sets of mutants. In 2014 IEEE seventh international conference on software testing, verification and validation . IEEE , 21–30. Paul Ammann, Marcio Eduardo Delamaro, and Jeff Offutt. 2014. Establishing theoretical minimal sets of mutants. In 2014 IEEE seventh international conference on software testing, verification and validation. IEEE, 21–30.

5. Thomas Ball . 2004 . A theory of predicate-complete test coverage and generation . In International Symposium on Formal Methods for Components and Objects. Springer, 1–22 . Thomas Ball. 2004. A theory of predicate-complete test coverage and generation. In International Symposium on Formal Methods for Components and Objects. Springer, 1–22.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3