Fine-Tuning Pre-Trained CodeBERT for Code Search in Smart Contract-Reference-Cited by-同舟云学术

Fine-Tuning Pre-Trained CodeBERT for Code Search in Smart Contract

Published:2023-06 Issue:3 Volume:28 Page:237-245
ISSN:1007-1202
Container-title:Wuhan University Journal of Natural Sciences
language:
Short-container-title:Wuhan Univ. J. Nat. Sci.

Author:

JIN Huan,LI Qinying

Abstract

Smart contracts, which automatically execute on decentralized platforms like Ethereum, require high security and low gas consumption. As a result, developers have a strong demand for semantic code search tools that utilize natural language queries to efficiently search for existing code snippets. However, existing code search models face a semantic gap between code and queries, which requires a large amount of training data. In this paper, we propose a fine-tuning approach to bridge the semantic gap in code search and improve the search accuracy. We collect 80 723 different pairs of <comment, code snippet> from Etherscan.io and use these pairs to fine-tune, validate, and test the pre-trained CodeBERT model. Using the fine-tuned model, we develop a code search engine specifically for smart contracts. We evaluate the Recall@k and Mean Reciprocal Rank (MRR) of the fine-tuned CodeBERT model using different proportions of the fine-tuned data. It is encouraging that even a small amount of fine-tuned data can produce satisfactory results. In addition, we perform a comparative analysis between the fine-tuned CodeBERT model and the two state-of-the-art models. The experimental results show that the fine-tuned CodeBERT model has superior performance in terms of Recall@k and MRR. These findings highlight the effectiveness of our fine-tuning approach and its potential to significantly improve the code search accuracy.

Publisher

EDP Sciences

Subject

Multidisciplinary

Link

https://wujns.edpsciences.org/10.1051/wujns/2023283237/pdf

Reference28 articles.

1. Wood D D. Ethereum: A secure decentralised generalised transaction ledger[J]. Ethereum Project Yellow Paper, 2014(1):1-32.

2. Chen X P, Liao P Y, Zhang Y X, et al. Understanding code reuse in smart contracts[C]//2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). New York: IEEE, 2021: 470-479.

3. He N Y, Wu L, Wang H Y, et al. Characterizing code clones in the Ethereum smart contract ecosystem[C]//Financial Cryptography and Data Security. Berlin: Springer-Verlag, 2020: 654-675.

4. Vacca A, Fredella M, Di Sorbo A, et al. An empirical investigation on the trade-off between smart contract readability and gas consumption[C]//Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension. New York: ACM, 2022: 214-224 .

5. Guida L C, Daniel F. Supporting reuse of smart contracts through service orientation and assisted development[C]//2019 IEEE International Conference on Decentralized Applications and Infrastructures (DAPPCON). New York: IEEE, 2019: 59-68.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Smart Contract Vulnerability Detection: The Role of Large Language Model (LLM);ACM SIGAPP Applied Computing Review;2024-06