Author:
Song Qihong,Hu Haize,Dai Tebo
Abstract
AbstractCode search aims to search for code snippets from large codebase that are semantically related to natural query statements. Deep learning is a valuable method for solving code search tasks in which the quality of training data directly impacts the performance of deep-learning models. However, most existing deep-learning models for code search research have overlooked the critical role of training data within batches, particularly hard negative samples, in optimizing model parameters. In this paper, we propose contrastive-metric learning CMCS for code search based on vector-level sampling and augmentation. Specifically, we propose a sampling method to obtain hard negative samples based on the K-means algorithm and a hardness-controllable sample augmentation method to obtain positive and hard negative samples based on vector-level augmentation techniques. We then design an optimization objective composed of metric learning and multimodal contrastive learning using obtained positive and hard negative samples. Extensive experiments were conducted on the large-scale dataset CodeSearchNet using seven advanced code search models. The results show that our proposed method significantly enhances the training efficiency and search performance of code search models, which is conducive to promoting software engineering development.
Publisher
Springer Science and Business Media LLC
Reference48 articles.
1. Liu, C. et al. Opportunities and challenges in code search tools. ACM Comput. Surv. 54, 1–40 (2021).
2. Bajracharya, S., Ossher, J. & Lopes, C. Sourcerer: An internet-scale software repository. In Proc. 2009 ICSE workshop on search-driven development-users, infrastructure, tools and evaluation, pp. 1–4 (IEEE, 2009).
3. Lu, M., Sun, X., Wang, S., Lo, D. & Duan, Y. Query expansion via wordnet for effective code search. In 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER), pp. 545–549 (IEEE, 2015).
4. Lv, F. et al. Codehow: Effective code search based on api understanding and extended boolean model. In 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE) , pp. 260–270 (IEEE, 2015).
5. Biggerstaff, T. J., Mitbander, B. G. & Webster, D. E. Program understanding and the concept assignment problem. Commun. ACM 37, 72–82 (1994).