Multi-Purpose Dataset of Webpages and Its Content Blocks: Design and Structure Validation-Reference-Cited by-同舟云学术

Multi-Purpose Dataset of Webpages and Its Content Blocks: Design and Structure Validation

Published:2021-04-07 Issue:8 Volume:11 Page:3319
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Griazev Kiril^ORCID,Ramanauskaitė Simona

Abstract

The need for automated data extraction is continuously growing due to the constant addition of information to the worldwide web. Researchers are developing new data extraction methods to achieve increased performance compared to existing methods. Comparing algorithms to evaluate their performance is vital when developing new solutions. Different algorithms require different datasets to test their performance due to the various data extraction approaches. Currently, most datasets tend to focus on a specific data extraction approach. Thus, they generally lack the data that may be useful for other extraction methods. That leads to difficulties when comparing the performance of algorithms that are vastly different in their approach. We propose a dataset of web page content blocks that includes various data points to counter this. We also validate its design and structure by performing block labeling experiments. Web developers of varying experience levels labeled multiple websites presented to them. Their labeling results were stored in the newly proposed dataset structure. The experiment proved the need for proposed data points and validated dataset structure suitability for multi-purpose dataset design.

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/11/8/3319/pdf

Reference24 articles.

1. Web mining taxonomy;Griazev,2018

2. Web data extraction, applications and techniques: A survey

3. A brief survey of web data extraction tools

4. A Comprehensive Survey on Web Content Extraction Algorithms and Techniques;Al-Ghuribi,2013

5. Learning Web Content Extraction with DOM Features;Utiu,2018

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Web Page Content Block Identification with Extended Block Properties;Applied Sciences;2023-05-05

2. Autonomous schema markups based on intelligent computing for search engine optimization;PeerJ Computer Science;2022-12-08