Abstract
Proper data items distribution may seriously improve the performance of data processing in distributed environment. However, typical datastorage systems as well as distributed computational frameworks do not pay special attention to that aspect. In this paper author introduces two custom data items addressing methods for distributed datastorage on the example of Scalable Distributed Two-Layer Datastore. The basic idea of those methods is to preserve that data items stored on the same cluster node are similar to each other following concepts of data clustering. Still, most of the data clustering mechanisms have serious problem with data scalability which is a severe limitation in Big Data applications. The proposed methods allow to efficiently distribute data set over a set of buckets. As it was shown by the experimental results, all proposed methods generate good results efficiently in comparison to traditional clustering techniques like k-means, agglomerative and birch clustering. Distributed environment experiments shown that proper data distribution can seriously improve the effectiveness of Big Data processing.
Subject
Artificial Intelligence,Computer Vision and Pattern Recognition,Theoretical Computer Science
Reference45 articles.
1. C.C. Aggarwal, S.Y. Philip, J. Han and J. Wang, A framework for clustering evolving data streams, in: Proceedings 2003 VLDB Conference, pages 81–92, Elsevier, 2003.
2. The clustering of galaxies in the completed sdss-iii baryon oscillation spectroscopic survey: cosmological analysis of the dr12 galaxy sample;Alam;Monthly Notices of the Royal Astronomical Society,2017
3. Distributed data clustering over networks;Altilio;Pattern Recognition,2019
4. R. Angles, A comparison of current graph database models, in: 2012 IEEE 28th International Conference on Data Engineering Workshops, pages 171–177, IEEE, 2012.
5. Partitioning-based clustering for web document categorization;Boley;Decision Support Systems,1999
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Ordination-based verification of feature selection in pattern evolution research;Intelligent Data Analysis;2023-10-12
2. Optimization Simulation of Big Data Analysis Model Based on K-means Algorithm;2023 International Conference on Networking, Informatics and Computing (ICNETIC);2023-05
3. Massive Natural Language Processing in Distributed Environment;Distributed Computing and Artificial Intelligence, Special Sessions I, 20th International Conference;2023