WEClustering: word embeddings based text clustering technique for large datasets-Reference-Cited by-同舟云学术

WEClustering: word embeddings based text clustering technique for large datasets

Published:2021-09-07 Issue:6 Volume:7 Page:3211-3224
ISSN:2199-4536
Container-title:Complex & Intelligent Systems
language:en
Short-container-title:Complex Intell. Syst.

Author:

Mehta Vivek^ORCID,Bawa Seema,Singh Jasmeet

Abstract

AbstractA massive amount of textual data now exists in digital repositories in the form of research articles, news articles, reviews, Wikipedia articles, and books, etc. Text clustering is a fundamental data mining technique to perform categorization, topic extraction, and information retrieval. Textual datasets, especially which contain a large number of documents are sparse and have high dimensionality. Hence, traditional clustering techniques such as K-means, Agglomerative clustering, and DBSCAN cannot perform well. In this paper, a clustering technique especially suitable to large text datasets is proposed that overcome these limitations. The proposed technique is based on word embeddings derived from a recent deep learning model named “Bidirectional Encoders Representations using Transformers”. The proposed technique is named as WEClustering. The proposed technique deals with the problem of high dimensionality in an effective manner, hence, more accurate clusters are formed. The technique is validated on several datasets of varying sizes and its performance is compared with other widely used and state of the art clustering techniques. The experimental comparison shows that the proposed clustering technique gives a significant improvement over other techniques as measured by metrics such Purity and Adjusted Rand Index.

Publisher

Springer Science and Business Media LLC

Subject

General Earth and Planetary Sciences,General Environmental Science

Link

https://link.springer.com/content/pdf/10.1007/s40747-021-00512-9.pdf

Reference49 articles.

1. Novel coronavirus resource directory (2020) https://www.elsevier.com/novel- coronavirus-covid-19. Accessed 1 Oct 2020

2. Adhikari A, Ram A, Tang R, Lin J (2019) Docbert: Bert for document classification. arXiv preprint arXiv:1904.08398

3. Aggarwal CC, Zhai C (2012) A survey of text clustering algorithms. Mining text data. Springer, New York, pp 77–128

4. Alammar J (2018) The illustrated bert, elmo, and co. http://jalammar.github.io/illustrated-bert/ . Accessed 25 Jan 2021

5. Almeida F, Xexéo G (2019) Word embeddings: a survey. arXiv preprint arXiv:1901.09069

Cited by 16 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Exploring associations between accident types and activities in construction using natural language processing;Automation in Construction;2024-08

2. Explainable Artificial Intelligence Methods to Enhance Transparency and Trust in Digital Deliberation Settings;Future Internet;2024-07-06

3. Using large language models to evaluate alternative uses task flexibility score;Thinking Skills and Creativity;2024-06

4. Density peaks clustering based on superior nodes and fuzzy correlation;Information Sciences;2024-06

5. A comprehensive and analytical review of text clustering techniques;International Journal of Data Science and Analytics;2024-04-08