Exploring the Impact of Vocabulary Techniques on Code Completion: A Comparative Approach
-
Published:2024-01-13
Issue:
Volume:
Page:1-23
-
ISSN:0218-1940
-
Container-title:International Journal of Software Engineering and Knowledge Engineering
-
language:en
-
Short-container-title:Int. J. Soft. Eng. Knowl. Eng.
Author:
Hussain Yasir1ORCID,
Huang Zhiqiu1ORCID,
Zhou Yu1ORCID,
Khan Izhar Ahmed1ORCID
Affiliation:
1. College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics (NUAA), Nanjing, Jiangsu 211106, P. R. China
Abstract
Integrated Development Environments (IDEs) are pivotal in enhancing productivity with features like code completion in modern software development. Recent advancements in Natural Language Processing (NLP) have empowered neural language models for code completion. In this study, we present an extensive investigation of the impact of open and closed vocabulary systems on the task of code completion. Specifically, we compare open and closed vocabulary systems with various vocabulary sizes to observe their impact on code completion performance. We experiment with three different open vocabulary systems: byte pair encoding (BPE), WordPiece and Unigram to compare them with closed-vocabulary systems to analyze their modeling performance. We also conduct experiments with different context sizes to study their impact on code completion performance. We have experimented using various prominent language models, including one from recurrent neural networks and five from transformers. Our results indicate that vocabulary size significantly impacts modeling performance and can artificially boost the accuracy of code completion models, especially in the case of a closed-vocabulary system. Moreover, we find that different vocabulary systems have varying impacts on token coverage, whereas open-vocabulary systems exhibit better token coverage. Our findings offer valuable insights for building effective code completion models, aiding researchers and practitioners in this field.
Funder
National Natural Science Foundation of China
Natural Science Foundation of Jiangsu Province
Publisher
World Scientific Pub Co Pte Ltd
Subject
Artificial Intelligence,Computer Graphics and Computer-Aided Design,Computer Networks and Communications,Software