SCC-GPT: Source Code Classification Based on Generative Pre-Trained Transformers-Reference-Cited by-同舟云学术

SCC-GPT: Source Code Classification Based on Generative Pre-Trained Transformers

Published:2024-07-07 Issue:13 Volume:12 Page:2128
ISSN:2227-7390
Container-title:Mathematics
language:en
Short-container-title:Mathematics

Author:

Alahmadi Mohammad D.¹^ORCID,Alshangiti Moayad¹^ORCID,Alsubhi Jumana²

Affiliation:

1. Department of Software Engineering, College of Computer Science and Engineering, University of Jeddah, Jeddah 23890, Saudi Arabia

2. School of Computing, University of Georgia, Athens, GA 30602, USA

Abstract

Developers often rely on online resources, such as Stack Overflow (SO), to seek assistance for programming tasks. To facilitate effective search and resource discovery, manual tagging of questions and posts with the appropriate programming language is essential. However, accurate tagging is not consistently achieved, leading to the need for the automated classification of code snippets into the correct programming language as a tag. In this study, we introduce a novel approach to automated classification of code snippets from Stack Overflow (SO) posts into programming languages using generative pre-trained transformers (GPT). Our method, which does not require additional training on labeled data or dependency on pre-existing labels, classifies 224,107 code snippets into 19 programming languages. We employ the text-davinci-003 model of ChatGPT-3.5 and postprocess its responses to accurately identify the programming language. Our empirical evaluation demonstrates that our GPT-based model (SCC-GPT) significantly outperforms existing methods, achieving a median F1-score improvement that ranges from +6% to +31%. These findings underscore the effectiveness of SCC-GPT in enhancing code snippet classification, offering a cost-effective and efficient solution for developers who rely on SO for programming assistance.

Funder

University of Jeddah, Jeddah, Saudi Arabia

Publisher

MDPI AG

Link

https://www.mdpi.com/2227-7390/12/13/2128/pdf

Reference37 articles.

1. Stanley, C., and Byrne, M.D. (2013, January 11–14). Predicting tags for stackoverflow posts. Proceedings of the ICCM, Ottawa, ON, Canada.

2. Beyer, S., and Pinzger, M. (2015, January 18–19). Synonym suggestion for tags on stack overflow. Proceedings of the 2015 IEEE 23rd International Conference on Program Comprehension, Florence, Italy.

3. Ye, D., Xing, Z., Li, J., and Kapre, N. (2016, January 4–8). Software-specific part-of-speech tagging: An experimental study on stack overflow. Proceedings of the 31st Annual ACM Symposium on Applied Computing, Pisa, Italy.

4. Yang, G., Zhou, Y., Yu, C., and Chen, X. (2021). DeepSCC: Source Code Classification Based on Fine-Tuned RoBERTa. arXiv.

5. Khasnabish, J.N., Sodhi, M., Deshmukh, J., and Srinivasaraghavan, G. (2014, January 21–24). Detecting programming language from source code using Bayesian learning techniques. Proceedings of the Machine Learning and Data Mining in Pattern Recognition: 10th International Conference, MLDM 2014, St. Petersburg, Russia. Proceedings 10.