Abstract
AbstractMachine Learning for Source Code () is an active research field in which extensive experimentation is needed to discover how to best use source code’s richly structured information. With this in mind, we introduce : An Extensible Java Dataset for Applications, which is a large-scale, diverse, and high-quality dataset targeted at . Our goal with is to lower the barrier to entry in by providing the building blocks to experiment with source code models and tasks. comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties (e.g., metrics, static analysis results) for 50,000 Java projects from the dataset, with over 1.2 million classes and over 8 million methods. is also extensible allowing users to add new properties and representations to the dataset, and evaluate tasks on them. Thus, becomes a workbench that researchers can use to experiment with novel representations and tasks operating on source code. To demonstrate the utility of the dataset, we also report results from two empirical studies on our data, ultimately showing that significant work lies ahead in the design of context-aware source code models that can reason over a broader network of source code entities in a software project—the very task that is designed to help with.
Funder
Libera Università di Bolzano
Publisher
Springer Science and Business Media LLC
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Out of Context: How important is Local Context in Neural Program Repair?;Proceedings of the IEEE/ACM 46th International Conference on Software Engineering;2024-04-12
2. INSPECT: Intrinsic and Systematic Probing Evaluation for Code Transformers;IEEE Transactions on Software Engineering;2023