Affiliation:
1. UNAM, Mexico
2. Nanyang Technical University, Singapore
Abstract
With the large amount of information available in the WWW, the ability to distinguish relevant from irrelevant data becomes a crucial factor. In this project, eight web scraping spiders were configured and evaluated for their functionality in order to determine their suitability for Interactive Digital media (IDM) start-ups to be utilized for competitive intelligence gathering. These spiders were chosen from the internet because of their availability and low cost. Each spider was configured and tested on two web sites. The evaluation process was first carried out individually to give a score to the spiders and then as a team to moderate the scores. The Web Info Extractor has the highest overall score as a web scraping spider while the Web Content Extractor has the best task analysis result. After the evaluation process, it is concluded that different spiders have varying capabilities and thus are suitable for different tasks. A spider that can handle more complex tasks is usually inherently more complex to configure and less-user friendly. Hence, in order to select the correct spider, companies should understand the tasks undertaken by their customers through basic task analysis as well as the knowledge of the amount of resources that they have at their disposal when it comes to configuring and operating the spiders.