A large-scale longitudinal study of flaky tests

Author:

Lam Wing1,Winter Stefan2,Wei Anjiang3,Xie Tao3ORCID,Marinov Darko1,Bell Jonathan4

Affiliation:

1. University of Illinois at Urbana-Champaign, USA

2. TU Darmstadt, Germany

3. Peking University, China

4. Northeastern University, USA

Abstract

Flaky tests are tests that can non-deterministically pass or fail for the same code version. These tests undermine regression testing efficiency, because developers cannot easily identify whether a test fails due to their recent changes or due to flakiness. Ideally, one would detect flaky tests right when flakiness is introduced, so that developers can then immediately remove the flakiness. Some software organizations, e.g., Mozilla and Netflix, run some tools—detectors—to detect flaky tests as soon as possible. However, detecting flaky tests is costly due to their inherent non-determinism, so even state-of-the-art detectors are often impractical to be used on all tests for each project change. To combat the high cost of applying detectors, these organizations typically run a detector solely on newly added or directly modified tests, i.e., not on unmodified tests or when other changes occur (including changes to the test suite, the code under test, and library dependencies). However, it is unclear how many flaky tests can be detected or missed by applying detectors in only these limited circumstances. To better understand this problem, we conduct a large-scale longitudinal study of flaky tests to determine when flaky tests become flaky and what changes cause them to become flaky. We apply two state-of-theart detectors to 55 Java projects, identifying a total of 245 flaky tests that can be compiled and run in the code version where each test was added. We find that 75% of flaky tests (184 out of 245) are flaky when added, indicating substantial potential value for developers to run detectors specifically on newly added tests. However, running detectors solely on newly added tests would still miss detecting 25% of flaky tests. The percentage of flaky tests that can be detected does increase to 85% when detectors are run on newly added or directly modified tests. The remaining 15% of flaky tests become flaky due to other changes and can be detected only when detectors are always applied to all tests. Our study is the first to empirically evaluate when tests become flaky and to recommend guidelines for applying detectors in the future.

Funder

National Science Foundation

Publisher

Association for Computing Machinery (ACM)

Subject

Safety, Risk, Reliability and Quality,Software

Reference57 articles.

1. AvoidingFlakeyTests 2019. TotT: Avoiding flakey tests. http://goo.gl/vHE47r. AvoidingFlakeyTests 2019. TotT: Avoiding flakey tests. http://goo.gl/vHE47r.

2. Jonathan Bell. 2014. Detecting isolating and enforcing dependencies among and within test cases. In FSE. Jonathan Bell. 2014. Detecting isolating and enforcing dependencies among and within test cases. In FSE.

3. Jonathan Bell and Gail Kaiser. 2014. Unit test virtualization with VMVM. In ICSE. Jonathan Bell and Gail Kaiser. 2014. Unit test virtualization with VMVM. In ICSE.

4. Jonathan Bell Owolabi Legunsen Michael Hilton Lamyaa Eloussi Tifany Yung and Darko Marinov. 2018. DeFlaker: Automatically detecting flaky tests. In ICSE. Jonathan Bell Owolabi Legunsen Michael Hilton Lamyaa Eloussi Tifany Yung and Darko Marinov. 2018. DeFlaker: Automatically detecting flaky tests. In ICSE.

5. Sebastian Burckhardt Pravesh Kothari Madanlal Musuvathi and Santosh Nagarakatte. 2010. A randomized scheduler with probabilistic guarantees of finding bugs. In ASPLOS. Sebastian Burckhardt Pravesh Kothari Madanlal Musuvathi and Santosh Nagarakatte. 2010. A randomized scheduler with probabilistic guarantees of finding bugs. In ASPLOS.

Cited by 39 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Non-Flaky and Nearly-Optimal Time-based Treatment of Asynchronous Wait Web Tests;ACM Transactions on Software Engineering and Methodology;2024-09-13

2. Neurosymbolic Repair of Test Flakiness;Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis;2024-09-11

3. Cost of Flaky Tests in Continuous Integration: An Industrial Case Study;2024 IEEE Conference on Software Testing, Verification and Validation (ICST);2024-05-27

4. Regression-Test History Data for Flaky-Test Research;Proceedings of the 1st International Workshop on Flaky Tests;2024-04-14

5. Can ChatGPT Repair Non-Order-Dependent Flaky Tests?;Proceedings of the 1st International Workshop on Flaky Tests;2024-04-14

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3