Topic Detection Based on Sentence Embeddings and Agglomerative Clustering with Markov Moment-Reference-Cited by-同舟云学术

Topic Detection Based on Sentence Embeddings and Agglomerative Clustering with Markov Moment

Published:2020-08-26 Issue:9 Volume:12 Page:144
ISSN:1999-5903
Container-title:Future Internet
language:en
Short-container-title:Future Internet

Author:

Bodrunova Svetlana S.^ORCID,Orekhov Andrey V.^ORCID,Blekanov Ivan S.^ORCID,Lyudkevich Nikolay S.^ORCID,Tarasov Nikita A.

Abstract

The paper is dedicated to solving the problem of optimal text classification in the area of automated detection of typology of texts. In conventional approaches to topicality-based text classification (including topic modeling), the number of clusters is to be set up by the scholar, and the optimal number of clusters, as well as the quality of the model that designates proximity of texts to each other, remain unresolved questions. We propose a novel approach to the automated definition of the optimal number of clusters that also incorporates an assessment of word proximity of texts, combined with text encoding model that is based on the system of sentence embeddings. Our approach combines Universal Sentence Encoder (USE) data pre-processing, agglomerative hierarchical clustering by Ward’s method, and the Markov stopping moment for optimal clustering. The preferred number of clusters is determined based on the “e-2” hypothesis. We set up an experiment on two datasets of real-world labeled data: News20 and BBC. The proposed model is tested against more traditional text representation methods, like bag-of-words and word2vec, to show that it provides a much better-resulting quality than the baseline DBSCAN and OPTICS models with different encoding methods. We use three quality metrics to demonstrate that clustering quality does not drop when the number of clusters grows. Thus, we get close to the convergence of text clustering and text classification.

Funder

Russian Science Foundation

Publisher

MDPI AG

Subject

Computer Networks and Communications

Link

https://www.mdpi.com/1999-5903/12/9/144/pdf

Reference34 articles.

1. Topic modelling for qualitative studies

2. Topic modelling in Russia: Current approaches and issues in methodology;Bodrunova

3. Latent dirichlet allocation;Blei;J. Mach. Learn. Res.,2003

4. A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis

Cited by 19 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Research on Optimal Design of Civil Sensors Based on Agglomerative Hierarchical Clustering Algorithm;Tehnicki vjesnik - Technical Gazette;2024-10-15

2. Pretrained Language Models for Semantics-Aware Data Harmonisation of Observational Clinical Studies in the Era of Big Data;2024-09-02

3. Pretrained Language Models for Semantics-Aware Data Harmonisation of Observational Clinical Studies in the Era of Big Data;2024-07-12

4. “Dirclustering”: a semantic clustering approach to optimize website structure discovery during penetration testing;Journal of Computer Virology and Hacking Techniques;2024-02-07

5. Financial Text Categorisation with FinBERT on Key Audit Matters;2023 IEEE Symposium on Computers & Informatics (ISCI);2023-10-14