An Improved Framework for Content- and Link-Based Web-Spam Detection: A Combined Approach

Author:

Shahzad Asim1ORCID,Nawi Nazri Mohd2ORCID,Rehman Muhammad Zubair3ORCID,Khan Abdullah4ORCID

Affiliation:

1. Faculty of Computer Science, Abbottabad University of Science and Technology, KPK, Abbottabad, Pakistan

2. Soft Computing & Data Mining Centre (SMC), Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja, Johor 86400, Malaysia

3. Faculty of Computing and IT, Sohar University, Sohar 311, Oman

4. Institute of Computer Sciences and Information Technology, Faculty of Management and Computer Sciences, University of Agriculture, Peshawar, Pakistan

Abstract

In this modern era, people utilise the web to share information and to deliver services and products. The information seekers use different search engines (SEs) such as Google, Bing, and Yahoo as tools to search for products, services, and information. However, web spamming is one of the most significant issues encountered by SEs because it dramatically affects the quality of SE results. Web spamming’s economic impact is enormous because web spammers index massive free advertising data on SEs to increase the volume of web traffic on a targeted website. Spammers trick an SE into ranking irrelevant web pages higher than relevant web pages in the search engine results pages (SERPs) using different web-spamming techniques. Consequently, these high-ranked unrelated web pages contain insufficient or inappropriate information for the user. To detect the spam web pages, several researchers from industry and academia are working. No efficient technique that is capable of catching all spam web pages on the World Wide Web (WWW) has been presented yet. This research is an attempt to propose an improved framework for content- and link-based web-spam identification. The framework uses stopwords, keywords’ frequency, part of speech (POS) ratio, spam keywords database, and copied-content algorithms for content-based web-spam detection. For link-based web-spam detection, we initially exposed the relationship network behind the link-based web spamming and then used the paid-link database, neighbour pages, spam signals, and link-farm algorithms. Finally, we combined all the content- and link-based spam identification algorithms to identify both types of spam. To conduct experiments and to obtain threshold values, WEBSPAM-UK2006 and WEBSPAM-UK2007 datasets were used. A promising F-measure of 79.6% with 81.2% precision shows the applicability and effectiveness of the proposed approach.

Funder

Universiti Tun Hussein Onn Malaysia

Publisher

Hindawi Limited

Subject

Multidisciplinary,General Computer Science

Reference49 articles.

1. Web spam taxonomy;Z. Gyongyi

2. Challenges in web search engines;M. R. Henzinger

3. Detecting spam web pages through content analysis

4. Document content based web spam detection using cosine similarity;N. Z. J. MCA;International Journal of Intelligence Research (IJOIR),2016

5. Search Engine Optimization Techniques for Malaysian University Websites: A Comparative Analysis on Google and Bing Search Engine

Cited by 3 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Graph Mining for Cybersecurity: A Survey;ACM Transactions on Knowledge Discovery from Data;2023-07-19

2. PRADA: Practical Black-box Adversarial Attacks against Neural Ranking Models;ACM Transactions on Information Systems;2023-04-08

3. Theory, Applications, and Challenges of Cyber-Physical Systems 2021;Complexity;2022-06-23

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3