ModelSet: a dataset for machine learning in model-driven engineering

Author:

López José Antonio Hernández,Cánovas Izquierdo Javier Luis,Cuadrado Jesús Sánchez

Abstract

AbstractThe application of machine learning (ML) algorithms to address problems related to model-driven engineering (MDE) is currently hindered by the lack of curated datasets of software models. There are several reasons for this, including the lack of large collections of good quality models, the difficulty to label models due to the required domain expertise, and the relative immaturity of the application of ML to MDE. In this work, we present ModelSet, a labelled dataset of software models intended to enable the application of ML to address software modelling problems. To create it we have devised a method designed to facilitate the exploration and labelling of model datasets by interactively grouping similar models using off-the-shelf technologies like a search engine. We have built an Eclipse plug-in to support the labelling process, which we have used to label 5,466 Ecore meta-models and 5,120 UML models with its category as the main label plus additional secondary labels of interest. We have evaluated the ability of our labelling method to create meaningful groups of models in order to speed up the process, improving the effectiveness of classical clustering methods. We showcase the usefulness of the dataset by applying it in a real scenario: enhancing the MAR search engine. We use ModelSet to train models able to infer useful metadata to navigate search results. The dataset and the tooling are available at https://figshare.com/s/5a6c02fa8ed20782935c and a live version at http://modelset.github.io.

Funder

Ministerio de Educación y Cultura

Publisher

Springer Science and Business Media LLC

Subject

Modeling and Simulation,Software

Reference57 articles.

1. Agt-Rickauer, H.: supporting domain modeling with automated knowledge acquisition and modeling recommendations. Ph.D. thesis (2020)

2. Allamanis, M., Barr, E.T., Devanbu, P., Sutton, C.: A survey of machine learning for big code and naturalness. ACM Comput. Surv. 51(4), 1–37 (2018)

3. Allamanis, M., Sutton, C.: Mining Idioms from Source Code. In: International symposium on foundations of software engineering, pp. 472–483 (2014)

4. Alon, U., Sadaka, R., Levy, O., Yahav, E.: Structural language models of code. In: International Conference on Machine Learning, PMLR, pp 245–256 (2020)

5. Alon, U., Zilberstein, M., Levy, O., Yahav, E.: Code2vec: learning distributed representations of code. ACM Program. Lang. 3(POPL), 1–29 (2019)

Cited by 13 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. ModelSet: A labelled dataset of software models for machine learning;Science of Computer Programming;2024-01

2. Language usage analysis for EMF metamodels on GitHub;Empirical Software Engineering;2023-12-13

3. Measuring and Clustering Heterogeneous Chatbot Designs;ACM Transactions on Software Engineering and Methodology;2023-12-13

4. EA ModelSet – A FAIR Dataset for Machine Learning in Enterprise Modeling;Lecture Notes in Business Information Processing;2023-11-25

5. Word Embeddings for Model-Driven Engineering;2023 ACM/IEEE 26th International Conference on Model Driven Engineering Languages and Systems (MODELS);2023-10-01

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3