Measurement of clustering effectiveness for document collections-Reference-Cited by-同舟云学术

Measurement of clustering effectiveness for document collections

Published:2022-01-10 Issue:3 Volume:25 Page:239-268
ISSN:1386-4564
Container-title:Information Retrieval Journal
language:en
Short-container-title:Inf Retrieval J

Author:

Yuan Meng,Zobel Justin^ORCID,Lin Pauline

Abstract

AbstractClustering of the contents of a document corpus is used to create sub-corpora with the intention that they are expected to consist of documents that are related to each other. However, while clustering is used in a variety of ways in document applications such as information retrieval, and a range of methods have been applied to the task, there has been relatively little exploration of how well it works in practice. Indeed, given the high dimensionality of the data it is possible that clustering may not always produce meaningful outcomes. In this paper we use a well-known clustering method to explore a variety of techniques, existing and novel, to measure clustering effectiveness. Results with our new, extrinsic techniques based on relevance judgements or retrieved documents demonstrate that retrieval-based information can be used to assess the quality of clustering, and also show that clustering can succeed to some extent at gathering together similar material. Further, they show that intrinsic clustering techniques that have been shown to be informative in other domains do not work for information retrieval. Whether clustering is sufficiently effective to have a significant impact on practical retrieval is unclear, but as the results show our measurement techniques can effectively distinguish between clustering methods.

Funder

University of Melbourne

Publisher

Springer Science and Business Media LLC

Subject

Library and Information Sciences,Information Systems

Link

https://link.springer.com/content/pdf/10.1007/s10791-021-09401-8.pdf

Reference76 articles.

1. Abdelhaq, H., Sengstock, C., & Gertz, M. (2013) Eventweet: online localized event detection from twitter. In Proceedings of VLDB international conference on very large databases (vol 6, pp. 1326–1329) https://doi.org/10.14778/2536274.2536307

2. Abraham, A., Das, S., & Konar, A. (2006) Document clustering using differential evolution. In IEEE international conference on evolutionary computation (pp. 1784–1791), https://doi.org/10.1109/CEC.2006.1688523

3. Abualigah, L. M. Q. (2019). Feature selection and enhanced krill herd algorithm for text document clustering. Springer.

4. Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J. M., & Perona, I. (2013). An extensive comparative study of cluster validity indices. Pattern Recognition, 46(1), 243–256. https://doi.org/10.1016/j.patcog.2012.07.021

5. Avrachenkov, K., Dobrynin, V., Nemirovsky, D., Pham, S.K., & Smirnova, E. (2008) Pagerank based clustering of hypertext document collections. In Proceedings of ACM-SIGIR international conference on research and development in information retrieval (pp. 873–874) https://doi.org/10.1145/1390334.1390549

Cited by 6 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. CLUSTERIZAÇÃO DE PROCESSOS JUDICIAIS COM ASSUNTOS SIMILARES;REVISTA FOCO;2024-03-28

2. Detecting Topics and Polarity From Twitter: A University Faculty Case;IEEE Access;2024

3. Academic information retrieval using citation clusters: in-depth evaluation based on systematic reviews;Scientometrics;2023-03-21

4. Arabic Document Clustering: A Survey;2022 4th International Conference on Current Research in Engineering and Science Applications (ICCRESA);2022-12-20

5. Generation of High-Quality Relevant Judgments through Document Similarity and Document Pooling for the Evaluation of Information Retrieval Systems;2022 14th International Conference on Software, Knowledge, Information Management and Applications (SKIMA);2022-12-02