Identification Author of Source Code by Machine Learning Methods-Reference-Cited by-同舟云学术

Identification Author of Source Code by Machine Learning Methods

Published:2019-06-04 Issue:3 Volume:18 Page:742-766
ISSN:2078-9599
Container-title:SPIIRAS Proceedings
language:
Short-container-title:Тр. СПИИРАН

Author:

Kurtukova Anna,Romanov Alexander

Abstract

The paper is devoted to the analysis of the problem of determining the source code author , which is of interest to researchers in the field of information security, computer forensics, assessment of the quality of the educational process, protection of intellectual property. The paper presents a detailed analysis of modern solutions to the problem. The authors suggest two new identification techniques based on machine learning algorithms: support vector machine, fast correlation filter and informative features; the technique based on hybrid convolutional recurrent neural network. The experimental database includes samples of source codes written in Java, C ++, Python, PHP, JavaScript, C, C # and Ruby. The data was obtained using a web service for hosting IT-projects – Github. The total number of source codes exceeds 150 thousand samples. The average length of each of them is 850 characters. The case size is 542 authors. The experiments were conducted with source codes written in the most popular programming languages. Accuracy of the developed techniques for different numbers of authors was assessed using 10-fold cross-validation. An additional series of experiments was conducted with the number of authors from 2 to 50 for the most popular Java programming language. The graphs of the relationship between identification accuracy and case size are plotted. The analysis of result showed that the method based on hybrid neural network gives 97% accuracy, and it’s at the present time the best-known result. The technique based on the support vector machine made it possible to achieve 96% accuracy. The difference between the results of the hybrid neural network and the support vector machine was approximately 5%.

Publisher

SPIIRAS

Subject

Artificial Intelligence,Computer Networks and Communications,Control and Systems Engineering,Control and Systems Engineering,Applied Mathematics

Cited by 7 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Authorship Identification of Binary and Disassembled Codes Using NLP Methods;Information;2023-06-25

2. Analysis of Source Code Authorship Attribution Problem;2022 International Conference on Computers and Artificial Intelligence Technologies (CAIT);2022-11-04

3. Complex Cases of Source Code Authorship Identification Using a Hybrid Deep Neural Network;Future Internet;2022-09-30

4. Source code authorship attribution using file embeddings;Companion Proceedings of the 2021 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity;2021-10-17

5. Authorship Identification of a Russian-Language Text Using Support Vector Machine and Deep Neural Networks;Future Internet;2020-12-25