Research and Application of Clustering Algorithm for Text Big Data-Reference-Cited by-同舟云学术

Research and Application of Clustering Algorithm for Text Big Data

Published:2022-06-08 Issue: Volume:2022 Page:1-8
ISSN:1687-5273
Container-title:Computational Intelligence and Neuroscience
language:en
Short-container-title:Computational Intelligence and Neuroscience

Author:

Chen Zi Li¹^ORCID

Affiliation:

1. Institute of General Aviation Industry, Fujian Chuanzheng Communications College, Fuzhou 350007, China

Abstract

In the era of big data, text as an information reserve database is very important, in all walks of life. From humanities research to government decision-making, from precision medicine to quantitative finance, from customer management to marketing, massive text, as one of the most important information carriers, plays an important role everywhere. The text data generated in these practical problems of humanities research, financial industry, marketing, and other fields often has obvious domain characteristics, often containing the professional vocabulary and unique language patterns in these fields and often accompanied by a variety of “noise.” Dealing with such texts is a great challenge for the current technical conditions, especially for Chinese texts. A clustering algorithm provides a better solution for text big data information processing. Clustering algorithm is the main body of cluster analysis, K-means algorithm with its implementation principle is simple, low time complexity is widely used in the field of cluster analysis, but its K value needs to be preset, initial clustering center random selection into local optimal solution, other clustering algorithm, such as mean drift clustering, K-means clustering in mining text big data. In view of the problems of the above algorithm, this paper first extracts and analyzes the text big data and then does experiments with the clustering algorithm. Experimental conclusion: by analyzing large-scale text data limited to large-scale and simple data set, the traditional K-means algorithm has low efficiency and reduced accuracy, and the K-means algorithm is susceptible to the influence of initial center and abnormal data. According to the above problems, the K-means cluster analysis algorithm for data sets with large data volumes is analyzed and improved to improve its execution efficiency and accuracy on data sets with large data volume set. Mean shift clustering can be regarded as making many random centers move towards the direction of maximum density gradually, that is, moving their mean centroid continuously according to the probability density of data and finally obtaining multiple maximum density centers. It can also be said that mean shift clustering is a kernel density estimation algorithm.

Publisher

Hindawi Limited

Subject

General Mathematics,General Medicine,General Neuroscience,General Computer Science

Link

http://downloads.hindawi.com/journals/cin/2022/7042778.pdf

Reference15 articles.

1. MapReduce Based Text Detection in Big Data Natural Scene Videos

2. Building text-based temporally linked event network for scientific big data analytics

3. Big data text analytics: an enabler of knowledge management;Z. Khan;Journal of Knowledge Management,1997

4. Text big data content understanding and development trend based on feature learning;S. Yuan;Big Data Research,2015

5. A big data preprocessing using statistical text mining;S. Jun;Journal of Wuhan Institute of Physical Education,2015

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Analysis and prediction of research hotspots and trends in heart failure research;J TRANSL INTERN MED;2024

2. MATHEMATICAL METHODS IN CYBER SECURITY: CLUSTER ANALYSIS AND ITS APPLICATION IN INFORMATION AND CYBERNETIC SECURITY;Cybersecurity: Education, Science, Technique;2024

3. Analysis and prediction of research hotspots and trends in pediatric medicine from 2,580,642 studies published between 1940 and 2021;World Journal of Pediatrics;2023-06-09

4. Recurrence Risk Evaluation in Patients with Papillary Thyroid Carcinoma: Multicenter Machine Learning Evaluation of Lymph Node Variables;Cancers;2023-01-16

5. Enhanced Mean Load Based Clustering Technique On Dented Image Segments In Reconstruction Of Buildings;2022 3rd International Conference on Communication, Computing and Industry 4.0 (C2I4);2022-12-15