Nearest‐neighbor, BERT‐based, scalable clone detection: A practical approach for large‐scale industrial code bases

Author:

Ahmed Gul Aftab1,Patten James Vincent2,Han Yuanhua3,Lu Guoxian4,Hou Wei5,Gregg David1,Buckley Jim2,Chochlov Muslim2ORCID

Affiliation:

1. Department of Computer Science Trinity College Dublin Dublin Ireland

2. Department of Computer Science and Information Systems University of Limerick Limerick Ireland

3. WN Digital IPD and Trustworthiness Enabling Huawei Technologies Co., Ltd. Xi'an Shaanxi China

4. WN Digital IPD and Trustworthiness Enabling Huawei Technologies Co., Ltd. Shanghai China

5. Huawei Vulnerability Management Center Huawei Technologies Co., Ltd. Shenzhen Guangdong China

Abstract

AbstractHidden code clones negatively impact software maintenance, but manually detecting them in large codebases is impractical. Additionally, automated approaches find detection of syntactically‐divergent clones very challenging. While recent deep neural networks (for example BERT‐based artificial neural networks) seem more effective in detecting such clones, their pairwise comparison of every code pair in the target system(s) is inefficient and scales poorly on large codebases. We present SSCD, a BERT‐based clone detection approach that targets high recall of Type 3 and Type 4 clones at a very large scale (in line with our industrial partner's requirements). It computes a representative embedding for each code fragment and finds similar fragments using a nearest neighbor search. Thus, SSCD avoids the pairwise‐comparison bottleneck of other neural network approaches, while also using a parallel, GPU‐accelerated search to tackle scalability. This article describes the approach, proposing and evaluating several refinements to improve Type 3/4 clone detection at scale. It provides a substantial empirical evaluation of the technique, including a speed/efficacy comparison of the approach against SourcererCC and Oreo, the only other neural‐network approach currently capable of scaling to hundreds of millions of LOC. It also includes a large in‐situ evaluation on our industrial collaborator's code base that assesses the original technique, the impact of the proposed refinements and illustrates the impact of incremental, active learning on its efficacy. We find that SSCD is significantly faster and more accurate than SourcererCC and Oreo. SAGA, a GPU‐accelerated traditional clone detection approach, is a little better than SSCD for T1/T2 clones, but substantially worse for T3/T4 clones. Thus, SSCD is both scalable to industrial code sizes, and comparatively more accurate than existing approaches for difficult T3/T4 clone searching. In‐situ evaluation on company datasets shows that SSCD outperforms the baseline approach (CCFinderX) for T3/T4 clones. Whitespace removal and active learning further improve SSCD effectiveness.

Funder

Science Foundation Ireland

Publisher

Wiley

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3