SCC-GPT: Source Code Classification Based on Generative Pre-Trained Transformers

Author:

Alahmadi Mohammad D.1ORCID,Alshangiti Moayad1ORCID,Alsubhi Jumana2

Affiliation:

1. Department of Software Engineering, College of Computer Science and Engineering, University of Jeddah, Jeddah 23890, Saudi Arabia

2. School of Computing, University of Georgia, Athens, GA 30602, USA

Abstract

Developers often rely on online resources, such as Stack Overflow (SO), to seek assistance for programming tasks. To facilitate effective search and resource discovery, manual tagging of questions and posts with the appropriate programming language is essential. However, accurate tagging is not consistently achieved, leading to the need for the automated classification of code snippets into the correct programming language as a tag. In this study, we introduce a novel approach to automated classification of code snippets from Stack Overflow (SO) posts into programming languages using generative pre-trained transformers (GPT). Our method, which does not require additional training on labeled data or dependency on pre-existing labels, classifies 224,107 code snippets into 19 programming languages. We employ the text-davinci-003 model of ChatGPT-3.5 and postprocess its responses to accurately identify the programming language. Our empirical evaluation demonstrates that our GPT-based model (SCC-GPT) significantly outperforms existing methods, achieving a median F1-score improvement that ranges from +6% to +31%. These findings underscore the effectiveness of SCC-GPT in enhancing code snippet classification, offering a cost-effective and efficient solution for developers who rely on SO for programming assistance.

Funder

University of Jeddah, Jeddah, Saudi Arabia

Publisher

MDPI AG

Reference37 articles.

1. Stanley, C., and Byrne, M.D. (2013, January 11–14). Predicting tags for stackoverflow posts. Proceedings of the ICCM, Ottawa, ON, Canada.

2. Beyer, S., and Pinzger, M. (2015, January 18–19). Synonym suggestion for tags on stack overflow. Proceedings of the 2015 IEEE 23rd International Conference on Program Comprehension, Florence, Italy.

3. Ye, D., Xing, Z., Li, J., and Kapre, N. (2016, January 4–8). Software-specific part-of-speech tagging: An experimental study on stack overflow. Proceedings of the 31st Annual ACM Symposium on Applied Computing, Pisa, Italy.

4. Yang, G., Zhou, Y., Yu, C., and Chen, X. (2021). DeepSCC: Source Code Classification Based on Fine-Tuned RoBERTa. arXiv.

5. Khasnabish, J.N., Sodhi, M., Deshmukh, J., and Srinivasaraghavan, G. (2014, January 21–24). Detecting programming language from source code using Bayesian learning techniques. Proceedings of the Machine Learning and Data Mining in Pattern Recognition: 10th International Conference, MLDM 2014, St. Petersburg, Russia. Proceedings 10.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3