A Framework for Automated Scraping of Structured Data Records From the Deep Web Using Semantic Labeling-Reference-Cited by-同舟云学术

A Framework for Automated Scraping of Structured Data Records From the Deep Web Using Semantic Labeling

Published:2022-01 Issue:1 Volume:12 Page:1-18
ISSN:2155-6377
Container-title:International Journal of Information Retrieval Research
language:ng
Short-container-title:

Author:

Kumaresan Umamageswari¹,Ramanujam Kalpana¹

Affiliation:

1. Pondicherry Engineering College, India

Abstract

The intent of this research is to come up with an automated web scraping system which is capable of extracting structured data records embedded in semi-structured web pages. Most of the automated extraction techniques in the literature captures repeated pattern among a set of similarly structured web pages, thereby deducing the template used for the generation of those web pages and then data records extraction is done. All of these techniques exploit computationally intensive operations such as string pattern matching or DOM tree matching and then perform manual labeling of extracted data records. The technique discussed in this paper departs from the state-of-the-art approaches by determining informative sections in the web page through repetition of informative content rather than syntactic structure. From the experiments, it is clear that the system has identified data rich region with 100% precision for web sites belonging to different domains. The experiments conducted on the real world web sites prove the effectiveness and versatility of the proposed approach.

Publisher

IGI Global

Subject

General Medicine

Reference23 articles.

1. Extracting structured data from web pages.;A.Arasu;Proc. ACM SIGMOD,2003

2. Web harvesting: web data extraction techniques for deep web pages;U.Baskaran;Web usage mining techniques and applications across industries,2017

3. Automated scraping of structured data records from health discussion forums using semantic analysis

4. Web Data Extraction System

5. Böhm, H. J., & Schneider, G. (2008). Virtual Screening for Bioactive Molecules. Retrieved from https://pubs.acs.org/doi/abs/10.1021/ja0152052

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A Declarative Query Language Enabled Autonomous Deep Web Search Engine;Proceedings of the 39th ACM/SIGAPP Symposium on Applied Computing;2024-04-08