Improving Classification of Documents by Semi-supervised Clustering in a Semantic Space-Reference-Cited by-同舟云学术

Improving Classification of Documents by Semi-supervised Clustering in a Semantic Space

Published:2023 Issue: Volume: Page:121-129
ISSN:1431-8814
Container-title:Studies in Classification, Data Analysis, and Knowledge Organization
language:
Short-container-title:

Author:

Dobša Jasminka,Kiers Henk A. L.

Abstract

AbstractIn the paper we propose a method for representation of documents in a semantic lower-dimensional space based on the modified Reduced k-means method which penalizes clusterings that are distant from classification of training documents given by experts. Reduced k-means (RKM) enables simultaneously clustering of documents and extraction of factors. By projection of documents represented in the vector space model on extracted factors, documents are clustered in the semantic space in a semi-supervised way (using penalization) because clustering is guided by classification given by experts, which enables improvement of classification performance of test documents. Classification performance is tested for classification by logistic regression and support vector machines (SVMs) for classes of Reuters-21578 data set. It is shown that representation of documents by the RKM method with penalization improves the average precision of classification by SVMs for the 25 largest classes of Reuters collection for about 5,5% with the same level of average recall in comparison to the basic representation in the vector space model. In the case of classification by logistic regression, representation by the RKM with penalization improves average recall for about 1% in comparison to the basic representation.

Publisher

Springer International Publishing

Link

https://link.springer.com/content/pdf/10.1007/978-3-031-09034-9_14

Reference13 articles.

1. Bengio, J., Ducharme, R., Vincet, P., Jauvin, C.: A Neural probabilistic language model. Journal of Machine Learning Research 3, 1137-1155 (1997)

2. Deerwester, S., Dumas, S. T., Furnas, G.W., Landauer, T. K., Harshman, R. A.: Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6), 381-407 (1990)

3. De Sarbo,W. S., Jedidi, K., Cool, K., Schendel, D.: Simultaneous multidimensional unfolding and cluster analysis: an investigation of strategic groups. Marketing Letters, 2, 129-146 (1990)

4. De Soete, G., Carroll, J. D.: K-means clustering in a low-dimensional Euclidean space. In: Diday, E., Lechevallier, Y., Schader, M., Bertrand, P., Burtschy, B. (eds.) New Approaches in Classification and Data Analysis. Studies in Classification, Data Analysis, and Knowledge Organization, pp. 212–219. Springer, Heidelberg (1994)

5. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of Annual Conference of the North American Chapter of the Association for Computation Linguistic, pp. 4171–4186, Association for Computational Linguistic (2019)