A new approach to web users clustering and validation: a divergence‐based scheme

Author:

Koutsonikola Vassiliki A.,Petridou Sophia G.,Vakali Athena I.,Papadimitriou Georgios I.

Abstract

PurposeWeb users' clustering is an important mining task since it contributes in identifying usage patterns, a beneficial task for a wide range of applications that rely on the web. The purpose of this paper is to examine the usage of Kullback‐Leibler (KL) divergence, an information theoretic distance, as an alternative option for measuring distances in web users clustering.Design/methodology/approachKL‐divergence is compared with other well‐known distance measures and clustering results are evaluated using a criterion function, validity indices, and graphical representations. Furthermore, the impact of noise (i.e. occasional or mistaken page visits) is evaluated, since it is imperative to assess whether a clustering process exhibits tolerance in noisy environments such as the web.FindingsThe proposed KL clustering approach is of similar performance when compared with other distance measures under both synthetic and real data workloads. Moreover, imposing extra noise on real data, the approach shows minimum deterioration among most of the other conventional distance measures.Practical implicationsThe experimental results show that a probabilistic measure such as KL‐divergence has proven to be quite efficient in noisy environments and thus constitute a good alternative, the web users clustering problem.Originality/valueThis work is inspired by the usage of divergence in clustering of biological data and it is introduced by the authors in the area of web clustering. According to the experimental results presented in this paper, KL‐divergence can be considered as a good alternative for measuring distances in noisy environments such as the web.

Publisher

Emerald

Subject

Computer Networks and Communications,Information Systems

Reference42 articles.

1. Baeza‐Yates, R. and Frakes, W. (1992), Information Retrieval: Data Structures and Algorithms, Prentice‐Hall, Upper Saddle River, NJ.

2. Boutin, F. and Hascoer, M. (2004), “Cluster validity indices for graph partitioning”, Proceedings of the 8th IEEE International Conference on Information Visualisation, London, pp. 376‐81.

3. Cadez, I., Heckerman, D., Meek, C., Smyth, P. and White, S. (2002), “Visualization of navigation patterns on a website using model‐based clustering”, Technical Report MSR‐TR‐00‐18, Microsoft Research.

4. Castellano, G., Fanelli, A.M., Mencar, C. and Torsello, M.A. (2007), “Similarity‐based fuzzy clustering for user profiling”, IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology Workshops, pp. 75‐8.

5. Charikar, M., Guha, S., Tardos, E. and Shmoys, D. (1999), “A constant‐factor approximation algorithm for the k‐median problem”, Proceedings of the 31st Annual ACM Symposium on Theory of Computing, (STOC), ACM, Atlanta, GA, May 1‐4, pp. 1‐10.

Cited by 3 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3