A focused crawler based on semantic disambiguation vector space model
-
Published:2022-07-05
Issue:1
Volume:9
Page:345-366
-
ISSN:2199-4536
-
Container-title:Complex & Intelligent Systems
-
language:en
-
Short-container-title:Complex Intell. Syst.
Author:
Liu Wenjun,He Yu,Wu Jing,Du Yajun,Liu Xing,Xi Tiejun,Gan Zurui,Jiang Pengjun,Huang Xiaoping
Abstract
AbstractThe focused crawler grabs continuously web pages related to the given topic according to priorities of unvisited hyperlinks. In many previous studies, the focused crawlers predict priorities of unvisited hyperlinks based on the text similarity models. However, the representation terms of the web page ignore the phenomenon of polysemy, and the topic similarity of the text cannot combine the cosine similarity and the semantic similarity effectively. To address these problems, this paper proposes a focused crawler based on semantic disambiguation vector space model (SDVSM). The SDVSM method combines the semantic disambiguation graph (SDG) and the semantic vector space model (SVSM). The SDG is used to remove the ambiguation terms irrelevant to the given topic from representation terms of retrieved web pages. The SVSM is used to calculate the topic similarity of the text by constructing text and topic semantic vectors based on TF × IDF weights of terms and semantic similarities between terms. The experiment results indicate that the SDVSM method can improve the performance of the focused crawler by comparing different evaluation indicators for four focused crawlers. In conclusion, the proposed method can make the focused crawler grab the higher quality and more quantity web pages related to the given topic from the Internet.
Funder
National Natural Science Foundation of China
Science and Technology Department of Sichuan Province
Education and Teaching Reform Research Project of Xihua University
the College Student Innovation and Entrepreneurship Training Project of Sichuan Province
Publisher
Springer Science and Business Media LLC
Subject
Computational Mathematics,Engineering (miscellaneous),Information Systems,Artificial Intelligence