Document Clustering Using K-Means with Term Weighting as Similarity-Based Constraints-Reference-Cited by-同舟云学术

Document Clustering Using K-Means with Term Weighting as Similarity-Based Constraints

Published:2020-06-06 Issue:6 Volume:12 Page:967
ISSN:2073-8994
Container-title:Symmetry
language:en
Short-container-title:Symmetry

Author:

Buatoom Uraiwan^ORCID,Kongprawechnon Waree^ORCID,Theeramunkong Thanaruk^ORCID

Abstract

In similarity-based constrained clustering, there have been various approaches on how to define the similarity between documents to guide the grouping of similar documents together. This paper presents an approach to use term-distribution statistics extracted from a small number of cue instances with their known classes, for term weightings as indirect distance constraint. As for distribution-based term weighting, three types of term-oriented standard deviations are exploited: distribution of a term in a collection (SD), average distribution of a term in a class (ACSD), and average distribution of a term among classes (CSD). These term weightings are explored with the consideration of symmetry concepts by varying the magnitude to positive and negative for promoting and demoting effects of three standard deviations. In k-means, followed the symmetry concept, both seeded and unseeded centroid initializations are investigated and compared to the centroid-based classification. Our experiment is conducted using five English text collections and one Thai text collection, i.e., Amazon, DI, WebKB1, WebKB2, and 20Newsgroup, as well as TR, a collection of Thai reform-related opinions. Compared to the conventional TFIDF, the distribution-based term weighting improves the centroid-based method, seeded k-means, and k-means with the error reduction rate of 22.45%, 31.13%, and 58.96%.

Funder

Thailand Research Fund

Thammasat University

Burapha University

National Science and Technology Development Agency

Publisher

MDPI AG

Subject

Physics and Astronomy (miscellaneous),General Mathematics,Chemistry (miscellaneous),Computer Science (miscellaneous)

Link

https://www.mdpi.com/2073-8994/12/6/967/pdf

Reference46 articles.

1. A Comparative Study on Clustering and Classification Algorithms;Goswami;Int. J. Sci. Eng. Appl. Sci. (IJSEAS),2015

2. A General Approach to Clustering in Large Databases with Noise

3. Kansei clustering for emotional design using a combined design structure matrix

4. Fuzzy Weighted Clustering Method for Numerical Attributes of Communication Big Data Based on Cloud Computing

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Improving Classification Performance with Statistically Weighted Dimensions and Dimensionality Reduction;Applied Sciences;2023-02-03

2. Document vector extension for document classification;INTERNATIONAL CONFERENCE ON SCIENCE, ENGINEERING, AND TECHNOLOGY 2022: Conference Proceedings;2023

3. Feature Discrimination of News Based on Canopy and KMGC-Search Clustering;IEEE Access;2022

4. Opinion Mining in Sociopolitical Research;Opportunities and Challenges for Computational Social Science Methods;2022

5. Analysis of big data job requirements based on K-means text clustering in China;PLOS ONE;2021-08-05