Affiliation:
1. Comunication and Information Technologies Department, University of A Coruña Campus de Elvi, A Coruña, Spain
Abstract
The main goal of this study is to present a scale that classifies crawling
systems according to their effectiveness in traversing the ?clientside?
Hidden Web. First, we perform a thorough analysis of the different
client-side technologies and the main features of the web pages in order to
determine the basic steps of the aforementioned scale. Then, we define the
scale by grouping basic scenarios in terms of several common features, and we
propose some methods to evaluate the effectiveness of the crawlers according
to the levels of the scale. Finally, we present a testing web site and we
show the results of applying the aforementioned methods to the results
obtained by some open-source and commercial crawlers that tried to traverse
the pages. Only a few crawlers achieve good results in treating client-side
technologies. Regarding standalone crawlers, we highlight the open-source
crawlers Heritrix and Nutch and the commercial crawler WebCopierPro, which is
able to process very complex scenarios. With regard to the crawlers of the
main search engines, only Google processes most of the scenarios we have
proposed, while Yahoo! and Bing just deal with the basic ones. There are not
many studies that assess the capacity of the crawlers to deal with
client-side technologies. Also, these studies consider fewer technologies,
fewer crawlers and fewer combinations. Furthermore, to the best of our
knowledge, our article provides the first scale for classifying crawlers from
the point of view of the most important client-side technologies.
Publisher
National Library of Serbia
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献