Design and analyses of web scraping on burstable virtual machines
-
Published:2023-12-27
Issue:
Volume:
Page:
-
ISSN:1532-0626
-
Container-title:Concurrency and Computation: Practice and Experience
-
language:en
-
Short-container-title:Concurrency and Computation
Author:
Drummond Lúcia Maria A.1,
Andrade Luciano1,
Muniz Pedro de Brito1,
Pereira Matheus Marotti1,
Silva Thiago do Prado1,
Teylo Luan12ORCID
Affiliation:
1. Instituto de Computação Universidade Federal Fluminense (UFF) Niterói Brazil
2. INRIA Bordeaux France
Abstract
SummaryWeb scraping is a widely used technique for decision‐making, collecting, and structuring public data from the internet. As the volume of data continues to grow, the need for more efficient methods of data extraction becomes crucial. This article introduces a novel web scraping framework that utilizes Burstable virtual machines (VMs) on Amazon Web Services with the objective of reducing the monetary cost of execution while ensuring compliance with service level agreements (SLAs). To achieve this, the framework utilizes a combination of fixed and temporary Burstable VMs in a mixed cluster, which can be elastically scaled up to fulfill the SLA and scaled down to minimize monetary costs. Two strategies for handling VM allocation are proposed and evaluated: (i) a queue and SLA‐based strategy that employs queue size information and SLA criteria to determine the required number of VMs for the current scraping requests, and (ii) a credit‐based strategy that incorporates information about Burstable VM credits to effectively manage instance creation and termination. Experimental tests show that the proposed framework meets the defined SLA while achieving cost reductions of up to 74% compared to an approach that executes on fixed‐size clusters of Burstable instances.
Funder
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior
Fundação Carlos Chagas Filho de Amparo à Pesquisa do Estado do Rio de Janeiro
Conselho Nacional de Desenvolvimento Científico e Tecnológico
Subject
Computational Theory and Mathematics,Computer Networks and Communications,Computer Science Applications,Theoretical Computer Science,Software
Reference18 articles.
1. Cloud Based Web Scraping for Big Data Applications
2. ServicesAW.Burstable performance instances. Accessed May 2022.https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/burstable‐performance‐instances.html
3. CloudO.Burstable instances. Accessed August 2023.https://docs.oracle.com/en‐us/iaas/Content/Compute/References/burstable‐instances.htm
4. Using Burstable Instances in the Public Cloud