JEMMA: An extensible Java dataset for ML4Code applications-Reference-Cited by-同舟云学术

JEMMA: An extensible Java dataset for ML4Code applications

Published:2023-03 Issue:2 Volume:28 Page:
ISSN:1382-3256
Container-title:Empirical Software Engineering
language:en
Short-container-title:Empir Software Eng

Author:

Karmakar Anjan^ORCID,Allamanis Miltiadis,Robbes Romain

Abstract

AbstractMachine Learning for Source Code () is an active research field in which extensive experimentation is needed to discover how to best use source code’s richly structured information. With this in mind, we introduce : An Extensible Java Dataset for Applications, which is a large-scale, diverse, and high-quality dataset targeted at . Our goal with is to lower the barrier to entry in by providing the building blocks to experiment with source code models and tasks. comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties (e.g., metrics, static analysis results) for 50,000 Java projects from the dataset, with over 1.2 million classes and over 8 million methods. is also extensible allowing users to add new properties and representations to the dataset, and evaluate tasks on them. Thus, becomes a workbench that researchers can use to experiment with novel representations and tasks operating on source code. To demonstrate the utility of the dataset, we also report results from two empirical studies on our data, ultimately showing that significant work lies ahead in the design of context-aware source code models that can reason over a broader network of source code entities in a software project—the very task that is designed to help with.

Funder

Libera Università di Bolzano

Publisher

Springer Science and Business Media LLC

Subject

Software

Link

https://link.springer.com/content/pdf/10.1007/s10664-022-10275-7.pdf

Reference95 articles.

1. Ahmad W, Chakraborty S, Ray B, Chang K W (2021) Unified pre-training for program understanding and generation. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.naacl-main.211

2. Allamanis M (2019) The adverse effects of code duplication in machine learning models of code. In: Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pp 143-153

3. Allamanis M, Sutton C (2013) Mining source code repositories at massive scale using language modeling. In: 2013 10Th working conference on mining software repositories, MSR, IEEE, pp 207-216

4. Allamanis M, Brockschmidt M, Khademi M (2017) Learning to represent programs with graphs. arXiv:171100740

5. Allamanis M, Barr E T, Devanbu P, Sutton C (2018) A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR) 51(4):1–37

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Out of Context: How important is Local Context in Neural Program Repair?;Proceedings of the IEEE/ACM 46th International Conference on Software Engineering;2024-04-12

2. INSPECT: Intrinsic and Systematic Probing Evaluation for Code Transformers;IEEE Transactions on Software Engineering;2023