Author:
López José Antonio Hernández,Cánovas Izquierdo Javier Luis,Cuadrado Jesús Sánchez
Abstract
AbstractThe application of machine learning (ML) algorithms to address problems related to model-driven engineering (MDE) is currently hindered by the lack of curated datasets of software models. There are several reasons for this, including the lack of large collections of good quality models, the difficulty to label models due to the required domain expertise, and the relative immaturity of the application of ML to MDE. In this work, we present ModelSet, a labelled dataset of software models intended to enable the application of ML to address software modelling problems. To create it we have devised a method designed to facilitate the exploration and labelling of model datasets by interactively grouping similar models using off-the-shelf technologies like a search engine. We have built an Eclipse plug-in to support the labelling process, which we have used to label 5,466 Ecore meta-models and 5,120 UML models with its category as the main label plus additional secondary labels of interest. We have evaluated the ability of our labelling method to create meaningful groups of models in order to speed up the process, improving the effectiveness of classical clustering methods. We showcase the usefulness of the dataset by applying it in a real scenario: enhancing the MAR search engine. We use ModelSet to train models able to infer useful metadata to navigate search results. The dataset and the tooling are available at https://figshare.com/s/5a6c02fa8ed20782935c and a live version at http://modelset.github.io.
Funder
Ministerio de Educación y Cultura
Publisher
Springer Science and Business Media LLC
Subject
Modeling and Simulation,Software
Reference57 articles.
1. Agt-Rickauer, H.: supporting domain modeling with automated knowledge acquisition and modeling recommendations. Ph.D. thesis (2020)
2. Allamanis, M., Barr, E.T., Devanbu, P., Sutton, C.: A survey of machine learning for big code and naturalness. ACM Comput. Surv. 51(4), 1–37 (2018)
3. Allamanis, M., Sutton, C.: Mining Idioms from Source Code. In: International symposium on foundations of software engineering, pp. 472–483 (2014)
4. Alon, U., Sadaka, R., Levy, O., Yahav, E.: Structural language models of code. In: International Conference on Machine Learning, PMLR, pp 245–256 (2020)
5. Alon, U., Zilberstein, M., Levy, O., Yahav, E.: Code2vec: learning distributed representations of code. ACM Program. Lang. 3(POPL), 1–29 (2019)
Cited by
23 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Teaching UML using a RAG-based LLM;2024 International Joint Conference on Neural Networks (IJCNN);2024-06-30
2. Deriving Domain Models From User Stories: Human vs. Machines;2024 IEEE 32nd International Requirements Engineering Conference (RE);2024-06-24
3. ModelXGlue: a benchmarking framework for ML tools in MDE;Software and Systems Modeling;2024-06-10
4. Accelerating similarity-based model matching using dual hashing;Software and Systems Modeling;2024-04-29
5. Automated detection of class diagram smells using self-supervised learning;Automated Software Engineering;2024-03-24