Automating Feature Extraction from Entity-Relation Models: Experimental Evaluation of Machine Learning Methods for Relational Learning-Reference-Cited by-同舟云学术

Automating Feature Extraction from Entity-Relation Models: Experimental Evaluation of Machine Learning Methods for Relational Learning

Published:2024-04-01 Issue:4 Volume:8 Page:39
ISSN:2504-2289
Container-title:Big Data and Cognitive Computing
language:en
Short-container-title:BDCC

Author:

Stanoev Boris¹²^ORCID,Mitrov Goran¹²^ORCID,Kulakov Andrea¹^ORCID,Mirceva Georgina¹^ORCID,Lameski Petre¹²^ORCID,Zdravevski Eftim¹²^ORCID

Affiliation:

1. Faculty of Computer Science and Engineering, Ss Cyril and Methodius University, 1000 Skopje, North Macedonia

2. Magix.AI, 1000 Skopje, North Macedonia

Abstract

With the exponential growth of data, extracting actionable insights becomes resource-intensive. In many organizations, normalized relational databases store a significant portion of this data, where tables are interconnected through some relations. This paper explores relational learning, which involves joining and merging database tables, often normalized in the third normal form. The subsequent processing includes extracting features and utilizing them in machine learning (ML) models. In this paper, we experiment with the propositionalization algorithm (i.e., Wordification) for feature engineering. Next, we compare the algorithms PropDRM and PropStar, which are designed explicitly for multi-relational data mining, to traditional machine learning algorithms. Based on the performed experiments, we concluded that Gradient Boost, compared to PropDRM, achieves similar performance (F1 score, accuracy, and AUC) on multiple datasets. PropStar consistently underperformed on some datasets while being comparable to the other algorithms on others. In summary, the propositionalization algorithm for feature extraction makes it feasible to apply traditional ML algorithms for relational learning directly. In contrast, approaches tailored specifically for relational learning still face challenges in scalability, interpretability, and efficiency. These findings have a practical impact that can help speed up the adoption of machine learning in business contexts where data is stored in relational format without requiring domain-specific feature extraction.

Funder

Faculty of Computer Science and Engineering at the Ss. Cyril and Methodius University in Skojpe, Macedonia

Publisher

MDPI AG

Link

https://www.mdpi.com/2504-2289/8/4/39/pdf

Reference34 articles.

1. Cost Optimization for Big Data Workloads Based on Dynamic Scheduling and Cluster-Size Tuning;Grzegorowski;Big Data Res.,2021

2. Zdravevski, E., Lameski, P., Dimitrievski, A., Grzegorowski, M., and Apanowicz, C. (2019, January 9–12). Cluster-size optimization within a cloud-based ETL framework for Big Data. Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA.

3. Zdravevski, E., Lameski, P., Kulakov, A., Jakimovski, B., Filiposka, S., and Trajanov, D. (2015, January 20–22). Feature Ranking Based on Information Gain for Large Classification Problems with MapReduce. Proceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA, Helsinki, Finland.

4. Ziarko, W.P. (1994). Proceedings of the Rough Sets, Fuzzy Sets and Knowledge Discovery, Springer.

5. Džeroski, S., and Lavrač, N. (2001). Relational Data Mining, Springer.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Combining Semantic Matching, Word Embeddings, Transformers, and LLMs for Enhanced Document Ranking: Application in Systematic Reviews;Big Data and Cognitive Computing;2024-09-04