Information-based massive data retrieval method based on distributed decision tree algorithm-Reference-Cited by-同舟云学术

Information-based massive data retrieval method based on distributed decision tree algorithm

Published:2022-06-18 Issue:01 Volume:14 Page:
ISSN:1793-9623
Container-title:International Journal of Modeling, Simulation, and Scientific Computing
language:en
Short-container-title:Int. J. Model. Simul. Sci. Comput.

Author:

Chen Bin¹,Chen Qingming²,Ye Peishan²

Affiliation:

1. China Southern Power Grid Co., Ltd., Guangzhou, Guangdong 510663, P. R. China

2. China Southern Power Grid Digital Media Technology Co. Ltd., Guangzhou, Guangdong 510060, P. R. China

Abstract

Based on the distributed decision tree algorithm, this paper first proposes a method of vertically partitioning datasets and synchronously updating the hash table to establish an information-based mass data retrieval method in a heterogeneous distributed environment, as well as using interval segmentation and interval filtering technologies for improved algorithm of distributed decision tree. The distributed decision tree algorithm uses the attribute histogram data structure to merge the category list into each attribute list, reducing the amount of data that needs to reside in the memory. Second, we adopt the strategy of vertically dividing the dataset and synchronously updating the hash table, select the hash table entries that can be used to update according to the minimum Gini value, modify the corresponding entries and use the hash table to record and control each sub-site. In the case of node splitting, it has a high accuracy rate. In addition, for classification problems that meet monotonic constraints in a distributed environment, this paper will extend the idea of building a monotonic decision tree in a distributed environment, supplementing the distributed decision tree algorithm, adding a modification rule and modifying the generated nonmonotonic decision tree to monotonicity. In order to solve the high load problem of the privacy-protected data stream classification mining algorithm under a single node, a Storm platform for the parallel algorithm PPFDT_P based on the distributed decision tree algorithm is designed and implemented. At the same time, considering that the word vector model improves the deep representation of features and solves the problem of feature high-dimensional sparseness, and the iterative decision tree algorithm GBDT model is more suitable for non-high-dimensional dense features, the iterative decision tree algorithm will be integrated into the word vector model (GBDT) in the data retrieval application, using the distributed representation of words, namely word vectors, to classify short messages on the GBDT model. Experimental results show that the distributed decision tree algorithm has high efficiency, good speed-up and good scalability, so that there is no need to increase the number of datasets at each sub-site at any time. Only a small number of data items are inserted. By splitting some leaf nodes, a small amount is added by branching to achieve a monotonic decision tree. The proposed system achieves a massive data ratio of 54.1% while compared with other networks of massive data ratio.

Funder

Research and demonstration application of key technologies of new base data center.

Publisher

World Scientific Pub Co Pte Ltd

Subject

Computer Science Applications,Modeling and Simulation,General Engineering,General Mathematics

Link

https://www.worldscientific.com/doi/pdf/10.1142/S1793962322430024

Reference24 articles.

1. Design and implementation of bank CRM system based on decision tree algorithm

2. Unstructured big data analysis algorithm and simulation of Internet of Things based on machine learning

3. Internet of health things-driven deep learning system for detection and classification of cervical cells using transfer learning

4. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering;Li J.,2021

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Efficient retrieval of power structured data with global data access view;Proceedings of the 2023 International Conference on Communication Network and Machine Learning;2023-10-27

2. Parallel Algorithm of High Precision Surface Modeling Based on Differential Geometry;2023 IEEE 4th Annual Flagship India Council International Subsections Conference (INDISCON);2023-08-05

3. ContextAD: Context-Aware Acronym Disambiguation with Siamese BERT Network;International Journal of Intelligent Systems;2023-07-29

4. Study on personalised search of English teaching resources database based on semantic association mining;International Journal of Computer Applications in Technology;2023