The performance of BERT as data representation of text clustering-Reference-Cited by-同舟云学术

The performance of BERT as data representation of text clustering

Published:2022-02-08 Issue:1 Volume:9 Page:
ISSN:2196-1115
Container-title:Journal of Big Data
language:en
Short-container-title:J Big Data

Author:

Subakti Alvin^ORCID,Murfi Hendri,Hariadi Nora

Abstract

AbstractText clustering is the task of grouping a set of texts so that text in the same group will be more similar than those from a different group. The process of grouping text manually requires a significant amount of time and labor. Therefore, automation utilizing machine learning is necessary. One of the most frequently used method to represent textual data is Term Frequency Inverse Document Frequency (TFIDF). However, TFIDF cannot consider the position and context of a word in a sentence. Bidirectional Encoder Representation from Transformers (BERT) model can produce text representation that incorporates the position and context of a word in a sentence. This research analyzed the performance of the BERT model as data representation for text. Moreover, various feature extraction and normalization methods are also applied for the data representation of the BERT model. To examine the performances of BERT, we use four clustering algorithms, i.e., k-means clustering, eigenspace-based fuzzy c-means, deep embedded clustering, and improved deep embedded clustering. Our simulations show that BERT outperforms TFIDF method in 28 out of 36 metrics. Furthermore, different feature extraction and normalization produced varied performances. The usage of these feature extraction and normalization must be altered depending on the text clustering algorithm used.

Funder

Kementerian Riset Teknologi Dan Pendidikan Tinggi Republik Indonesia

Publisher

Springer Science and Business Media LLC

Subject

Information Systems and Management,Computer Networks and Communications,Hardware and Architecture,Information Systems

Link

https://link.springer.com/content/pdf/10.1186/s40537-022-00564-9.pdf

Reference30 articles.

1. Bishop CM. Pattern recognition. Mach Learn. 2006;128:9.

2. Aggarwal CC, Zhai C. A survey of text clustering algorithms. In: mining text data. New York, London: Springer; 2012. p. 77–128.

3. Parlina A, Ramli K, Murfi H. Exposing emerging trends in smart sustainable city research using deep autoencoders-based fuzzy c-means. Sustainability. 2021;13(5):2876.

4. Xiong C, Hua Z, Lv K, Li X. An improved k-means text clustering algorithm by optimizing initial cluster centers. In: 2016 7th International Conference on Cloud Computing and Big Data (CCBD). New York: IEEE; 2016. p. 265–268.

5. Murfi H. The accuracy of fuzzy c-means in lower-dimensional space for topic detection. In: International Conference on Smart Computing and Communication. Berlin: Springer. 2018; p. 321–334.

Cited by 44 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Evaluating Active Learning Strategies for Automated Classification of Patient Safety Event Reports in Hospitals;Proceedings of the Human Factors and Ergonomics Society Annual Meeting;2024-08-13

2. Deep Transfer Learning Hybrid Techniques for Precision in Breast Cancer Tumor Histopathology Classification;2024-07-12

3. Revolutionary text clustering: Investigating transfer learning capacity of SBERT models through pooling techniques;Engineering Science and Technology, an International Journal;2024-07

4. Revolutionizing NLP: Multimodal Integration for Enhanced Image-to-Text Extraction;2024 3rd International Conference on Computational Modelling, Simulation and Optimization (ICCMSO);2024-06-14

5. CVs Classification Using Neural Network Approaches Combined with BERT and Gensim: CVs of Moroccan Engineering Students;Data;2024-05-24