Code stylometry vs formatting and minification-Reference-Cited by-同舟云学术

Code stylometry vs formatting and minification

Published:2024-09-06 Issue: Volume:10 Page:e2142
ISSN:2376-5992
Container-title:PeerJ Computer Science
language:en
Short-container-title:

Author:

Balla Stefano¹,Gabbrielli Maurizio¹,Zacchiroli Stefano²

Affiliation:

1. DISI, University of Bologna, Bologna, Italy

2. LTCI, Télécom Paris, Institut Polytechnique de Paris, Palaiseau, France

Abstract

The automatic identification of code authors based on their programming styles—known as authorship attribution or code stylometry—has become possible in recent years thanks to improvements in machine learning-based techniques for author recognition. Once feasible at scale, code stylometry can be used for well-intended or malevolent activities, including: identifying the most expert coworker on a piece of code (if authorship information goes missing); fingerprinting open source developers to pitch them unsolicited job offers; de-anonymizing developers of illegal software to pursue them. Depending on their respective goals, stakeholders have an interest in making code stylometry either more or less effective. To inform these decisions we investigate how the accuracy of code stylometry is impacted by two common software development activities: code formatting and code minification. We perform code stylometry on Python code from the Google Code Jam dataset (59 authors) using a code2vec-based author classifier on concrete syntax tree (CST) representations of input source files. We conduct the experiment using both CSTs and ASTs (abstract syntax trees). We compare the respective classification accuracies on: (1) the original dataset, (2) the dataset formatted with Black, and (3) the dataset minified with Python Minifier. Our results show that: (1) CST-based stylometry performs better than AST-based (51.00%→68%), (2) code formatting makes a significant dent (15%) in code stylometry accuracy (68%→53%), with minification subtracting a further 3% (68%→50%). While the accuracy reduction is significant for both code formatting and minification, neither is enough to make developers non-recognizable via code stylometry.

Publisher

PeerJ

Link

https://peerj.com/articles/cs-2142.pdf

Reference39 articles.

1. A general path-based representation for predicting program properties;Alon,2018

2. code2vec: learning distributed representations of code;Alon;Proceedings of the ACM on Programming Languages,2019

3. Source code authorship attribution using long short-term memory based networks;Alsulami,2017

4. user2code2vec: embeddings for profiling students based on distributional representations of source code;Azcona,2019

5. A study of the behavior of several methods for balancing machine learning training data;Batista;SIGKDD Explor,2004