Improving Large-Gap Clone Detection Recall Using Multiple Features-Reference-Cited by-同舟云学术

Improving Large-Gap Clone Detection Recall Using Multiple Features

Published:2022-07 Issue:07 Volume:32 Page:1071-1099
ISSN:0218-1940
Container-title:International Journal of Software Engineering and Knowledge Engineering
language:en
Short-container-title:Int. J. Soft. Eng. Knowl. Eng.

Author:

Dai Peng¹^ORCID,Zhang Qianjin²,Wang Yawen¹²,Jin Dahai²,Gong Yunzhan²

Affiliation:

1. State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing Shi, Haidian Qu, 100876, P. R. China

2. Guangxi Key Laboratory of Cryptography and Information Security, Guangxi, Guilin, 541004, P. R. China

Abstract

Code clone refers to two or more identical or similar source code fragments. Research on code clone detection has lasted for decades. Investigation and evaluation of existing clone detection techniques indicate that they are resilient to function-level clone detection. Still, there may be room for further research in block-level clone detection. Particularly, type-3 clones that include large gaps, are ongoing challenges. To solve these problems, we propose a clone detection method based on multiple code features. It aims to improve the recall rate of code block clone detection and overcome large-gap and hard-to-detect type-3 clones. This method first splits the source code files based on the program’s structural features and context features to obtain code blocks. The collection of code blocks obtained in this way is complete, and the large gaps in clone pairs will also be removed. In addition, we only need to compute the similarity between code blocks with the same structural features, which can also significantly save time and resources. The similarity is obtained by calculating the proportion of the same tokens between two code blocks. Moreover, since different types of tokens have different weights in similarity calculation, we use supervised learning to obtain a classifier model between token features and code clone. We divide the tokens into 13 types and train the machine learning model with the manually confirmed clone or non-clone pair. Finally, we develop a prototype system and compare our tools with existing tools under the Mutation Framework and in several actual C projects. The experimental results also demonstrate the advancement and practicality of our prototype.

Funder

Innovative Research Group Project of the National Natural Science Foundation of China

Guangxi Key Laboratory of Cryptography and Information Security, China

Publisher

World Scientific Pub Co Pte Ltd

Subject

Artificial Intelligence,Computer Graphics and Computer-Aided Design,Computer Networks and Communications,Software

Link

https://www.worldscientific.com/doi/pdf/10.1142/S0218194022500413

Reference48 articles.

1. An Empirical Study of the Impacts of Clones in Software Maintenance

2. On finding duplication and near-duplication in large software systems

3. Clone detection using abstract syntax trees

4. CP-Miner: finding copy-paste and related bugs in large-scale software code

5. Stack Overflow: A code laundering platform?

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Deep Neural Network Optimization Based on Binary Method for Handling Multi-Class Problems;IEEE Access;2024