On the relative value of imbalanced learning for code smell detection

Author:

Li Fuyang1,Zou Kuan12,Keung Jacky Wai3,Yu Xiao145ORCID,Feng Shuo6,Xiao Yan7

Affiliation:

1. School of Computer Science and Artificial Intelligence Wuhan University of Technology Wuhan China

2. School of Computer, Electronics and Information Guangxi University Nanning China

3. Department of Computer Science City University of Hong Kong Hong Kong China

4. Sanya Science and Education Innovation Park of Wuhan University of Technology Sanya China

5. Wuhan University of Technology Chongqing Research Institute Chongqing China

6. School of Computer and Artificial Intelligence Zhengzhou University Zhengzhou China

7. School of Cyber Science and Technology Shenzhen Campus, Sun Yat‐sen University Shenzhen China

Abstract

SummaryMachine learning‐based code smell detection (CSD) has been demonstrated to be a valuable approach for improving software quality and enabling developers to identify problematic patterns in code. However, previous researches have shown that the code smell datasets commonly used to train these models are heavily imbalanced. While some recent studies have explored the use of imbalanced learning techniques for CSD, they have only evaluated a limited number of techniques and thus their conclusions about the most effective methods may be biased and inconclusive. To thoroughly evaluate the effect of imbalanced learning techniques for machine learning‐based CSD, we examine 31 imbalanced learning techniques with seven classifiers to build CSD models on four code smell data sets. We employ four evaluation metrics to assess the detection performance with the Wilcoxon signed‐rank test and Cliff's . The results show that (1) Not all imbalanced learning techniques significantly improve detection performance, but deep forest significantly outperforms the other techniques on all code smell data sets. (2) SMOTE (Synthetic Minority Over‐sampling TEchnique) is not the most effective technique for resampling code smell data sets. (3) The best‐performing imbalanced learning techniques and the top‐3 data resampling techniques have little time cost for code smell detection. Therefore, we provide some practical guidelines. First, researchers and practitioners should select the appropriate imbalanced learning techniques (e.g., deep forest) to ameliorate the class imbalance problem. In contrast, the blind application of imbalanced learning techniques could be harmful. Then, better data resampling techniques than SMOTE should be selected to preprocess the code smell data sets.

Funder

Sanya Yazhou Bay Science and Technology City

National Natural Science Foundation of China

Natural Science Foundation of Chongqing

Publisher

Wiley

Subject

Software

Reference83 articles.

1. A systematic literature review on the detection of smells and their evolution in object-oriented and service-oriented systems

2. Deeplinedp: towards a deep learning approach for line‐level defect prediction;Pornprasit C;IEEE Trans Softw Eng,2022

3. Software Aging Prediction for Cloud Services Using a Gate Recurrent Unit Neural Network Model Based on Time Series Decomposition

4. Predicting the precise number of software defects: Are we there yet?

5. Dssdpp: data selection and sampling based domain programming predictor for cross‐project defect prediction;Li Z;IEEE Trans Softw Eng,2022

Cited by 9 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Data preparation for Deep Learning based Code Smell Detection: A systematic literature review;Journal of Systems and Software;2024-10

2. Enhancing Deep Learning Vulnerability Detection through Imbalance Loss Functions: An Empirical Study;Proceedings of the 15th Asia-Pacific Symposium on Internetware;2024-07-24

3. Practitioners' Expectations on Code Smell Detection;2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC);2024-07-02

4. On the relative value of clustering techniques for Unsupervised Effort-Aware Defect Prediction;Expert Systems with Applications;2024-07

5. Improving accuracy of code smells detection using machine learning with data balancing techniques;The Journal of Supercomputing;2024-06-05

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3