Author:
Eddamiri Siham,Benghabrit Asmaa,Zemmouri Elmoukhtar
Abstract
PurposeThe purpose of this paper is to present a generic pipeline for Resource Description Framework (RDF) graph mining to provide a comprehensive review of each step in the knowledge discovery from data process. The authors also investigate different approaches and combinations to extract feature vectors from RDF graphs to apply the clustering and theme identification tasks.Design/methodology/approachThe proposed methodology comprises four steps. First, the authors generate several graph substructures (Walks, Set of Walks, Walks with backward and Set of Walks with backward). Second, the authors build neural language models to extract numerical vectors of the generated sequences by using word embedding techniques (Word2Vec and Doc2Vec) combined with term frequency-inverse document frequency (TF-IDF). Third, the authors use the well-known K-means algorithm to cluster the RDF graph. Finally, the authors extract the most relevant rdf:type from the grouped vertices to describe the semantics of each theme by generating the labels.FindingsThe experimental evaluation on the state of the art data sets (AIFB, BGS and Conference) shows that the combination of Set of Walks-with-backward with TF-IDF and Doc2vec techniques give excellent results. In fact, the clustering results reach more than 97% and 90% in terms of purity andF-measure, respectively. Concerning the theme identification, the results show that by using the same combination, the purity andF-measure criteria reach more than 90% for all the considered data sets.Originality/valueThe originality of this paper lies in two aspects: first, a new machine learning pipeline for RDF data is presented; second, an efficient process to identify and extract relevant graph substructures from an RDF graph is proposed. The proposed techniques were combined with different neural language models to improve the accuracy and relevance of the obtained feature vectors that will be fed to the clustering mechanism.
Subject
Computer Networks and Communications,Information Systems
Reference81 articles.
1. An information-theoretic perspective of tf-idf measures;Information Processing and Management,2003
2. Graph kernels between point clouds,2008
3. Kernel methods for mining instance data in ontologies,2007
4. Simplifying RDF data for graph-based machine learning,2014
5. Translating embeddings for modeling multi-relational data;Advances in Neural Information Processing Systems,2013
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Mining Electronic Health Records of Patients Using Linked Data for Ranking Diseases;EAI/Springer Innovations in Communication and Computing;2021-05-06
2. Theme Identification for RDF Graphs Based on LSTM Neural Reccurent Network;Proceedings of the International Conference on Artificial Intelligence and Computer Vision (AICV2021);2021