Supervised Authorship Segmentation of Open Source Code Projects-Reference-Cited by-同舟云学术

Supervised Authorship Segmentation of Open Source Code Projects

Published:2021-07-23 Issue:4 Volume:2021 Page:464-479
ISSN:2299-0984
Container-title:Proceedings on Privacy Enhancing Technologies
language:en
Short-container-title:

Author:

Dauber Edwin¹,Erbacher Robert²,Shearer Gregory³,Weisman Michael²,Nelson Frederica²,Greenstadt Rachel⁴

Affiliation:

1. Drexel University

2. United States Army Research Laboratory

3. ICF International

4. New York University

Abstract

Abstract Source code authorship attribution can be used for many types of intelligence on binaries and executables, including forensics, but introduces a threat to the privacy of anonymous programmers. Previous work has shown how to attribute individually authored code files and code segments. In this work, we examine authorship segmentation, in which we determine authorship of arbitrary parts of a program. While previous work has performed segmentation at the textual level, we attempt to attribute subtrees of the abstract syntax tree (AST). We focus on two primary problems: identifying the primary author of an arbitrary AST subtree and identifying on which edges of the AST primary authorship changes. We demonstrate that the former is a difficult problem but the later is much easier. We also demonstrate methods by which we can leverage the easier problem to improve accuracy for the harder problem. We show that while identifying the author of subtrees is difficult overall, this is primarily due to the abundance of small subtrees: in the validation set we can attribute subtrees of at least 25 nodes with accuracy over 80% and at least 33 nodes with accuracy over 90%, while in the test set we can attribute subtrees of at least 33 nodes with accuracy of 70%. While our baseline accuracy for single AST nodes is 20.21% for the validation set and 35.66% for the test set, we present techniques by which we can increase this accuracy to 42.01% and 49.21% respectively. We further present observations about collaborative code found on GitHub that may drive further research.

Publisher

Walter de Gruyter GmbH

Subject

General Medicine

Link

https://www.sciendo.com/pdf/10.2478/popets-2021-0080

Reference20 articles.

1. [1] Mohammed Abuhamad, Tamer AbuHmed, Aziz Mohaisen, and DaeHun Nyang. 2018. Large-Scale and Language-Oblivious Code Authorship Identification. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. ACM, 101–114.

2. [2] Mohammed Abuhamad, Tamer Abuhmed, DaeHun Nyang, and David Mohaisen. 2020. Multi-χ: Identifying Multiple Authors from Source Code Files. Proceedings on Privacy Enhancing Technologies 1 (2020), 17.

3. [3] Alfred V Aho, Ravi Sethi, and Jeffrey D Ullman. 1986. Compilers, Principles, Techniques. Addison wesley.

4. [4] Navot Akiva and Moshe Koppel. 2012. Identifying distinct components of a multi-author document. In 2012 European Intelligence and Security Informatics Conference. IEEE, 205–209.

5. [5] Navot Akiva and Moshe Koppel. 2013. A generic unsupervised method for decomposing multi-author documents. Journal of the American Society for Information Science and Technology 64, 11 (2013), 2256–2264.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Distinguishing AI- and Human-Generated Code: A Case Study;Proceedings of the 2023 Workshop on Software Supply Chain Offensive Research and Ecosystem Defenses;2023-11-26