CodeMatcher: Searching Code Based on Sequential Semantics of Important Query Words-Reference-Cited by-同舟云学术

CodeMatcher: Searching Code Based on Sequential Semantics of Important Query Words

Published:2022-01-31 Issue:1 Volume:31 Page:1-37
ISSN:1049-331X
Container-title:ACM Transactions on Software Engineering and Methodology
language:en
Short-container-title:ACM Trans. Softw. Eng. Methodol.

Author:

Liu Chao¹,Xia Xin²,Lo David³,Liu Zhiwe⁴,Hassan Ahmed E.⁵,Li Shanping¹

Affiliation:

1. Zhejiang University, Hangzhou, Zhejiang, China

2. Huawei, Hangzhou, Zhejiang, China

3. Singapore Management University, Singapore

4. Baidu Inc., Shanghai, China

5. Queen's University, Kingston, Ontario, Canada

Abstract

To accelerate software development, developers frequently search and reuse existing code snippets from a large-scale codebase, e.g., GitHub. Over the years, researchers proposed many information retrieval (IR)-based models for code search, but they fail to connect the semantic gap between query and code. An early successful deep learning (DL)-based model DeepCS solved this issue by learning the relationship between pairs of code methods and corresponding natural language descriptions. Two major advantages of DeepCS are the capability of understanding irrelevant/noisy keywords and capturing sequential relationships between words in query and code. In this article, we proposed an IR-based model CodeMatcher that inherits the advantages of DeepCS (i.e., the capability of understanding the sequential semantics in important query words), while it can leverage the indexing technique in the IR-based model to accelerate the search response time substantially. CodeMatcher first collects metadata for query words to identify irrelevant/noisy ones, then iteratively performs fuzzy search with important query words on the codebase that is indexed by the Elasticsearch tool and finally reranks a set of returned candidate code according to how the tokens in the candidate code snippet sequentially matched the important words in a query. We verified its effectiveness on a large-scale codebase with ~41K repositories. Experimental results showed that CodeMatcher achieves an MRR (a widely used accuracy measure for code search) of 0.60, outperforming DeepCS, CodeHow, and UNIF by 82%, 62%, and 46%, respectively. Our proposed model is over 1.2K times faster than DeepCS. Moreover, CodeMatcher outperforms two existing online search engines (GitHub and Google search) by 46% and 33%, respectively, in terms of MRR. We also observed that: fusing the advantages of IR-based and DL-based models is promising; improving the quality of method naming helps code search, since method name plays an important role in connecting query and code.

Publisher

Association for Computing Machinery (ACM)

Subject

Software

Link

https://dl.acm.org/doi/pdf/10.1145/3465403

Reference73 articles.

1. Analyzing and mining a code search engine usage log

2. Enriching Word Vectors with Subword Information

Cited by 7 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. An Empirical Study of Code Search in Intelligent Coding Assistant: Perceptions, Expectations, and Directions;Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering;2024-07-10

2. Performance of Traditional and Dense Vector Information Retrieval Models in Code Search;2024 2nd International Conference on Software Engineering and Information Technology (ICoSEIT);2024-02-28

3. FuEPRe: a fusing embedding method with attention for post recommendation;Service Oriented Computing and Applications;2024-02-23

4. Feature Location Using Extraction of Code Documentation;Proceedings of the 8th International Conference on Sustainable Information Engineering and Technology;2023-10-24

5. Semantic-Enriched Code Knowledge Graph to Reveal Unknowns in Smart Contract Code Reuse;ACM Transactions on Software Engineering and Methodology;2023-09-30