A Novel Approach to Data Extraction on Hyperlinked Webpages-Reference-Cited by-同舟云学术

A Novel Approach to Data Extraction on Hyperlinked Webpages

Published:2019-11-25 Issue:23 Volume:9 Page:5102
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Shaukat ,Masood ,Khushi

Abstract

The World Wide Web has an enormous amount of useful data presented as HTML tables. These tables are often linked to other web pages, providing further detailed information to certain attribute values. Extracting schema of such relational tables is a challenge due to the non-existence of a standard format and a lack of published algorithms. We downloaded 15,000 web pages using our in-house developed web-crawler, from various web sites. Tables from the HTML code were extracted and table rows were labeled with appropriate class labels. Conditional random fields (CRF) were used for the classification of table rows, and a nondeterministic finite automaton (NFA) algorithm was designed to identify simple, complex, hyperlinked, or non-linked tables. A simple schema for non-linked tables was extracted and for the linked-tables, relational schema in the form of primary and foreign-keys (PK and FK) were developed. Child tables were concatenated with the parent table’s attribute value (PK), serving as foreign keys (FKs). Resultantly, these tables could assist with performing better and stronger queries using the join operation. A manual checking of the linked web table results revealed a 99% precision and 68% recall values. Our 15,000-strong downloadable corpus and a novel algorithm will provide the basis for further research in this field.

Funder

Universitetet i Stavanger

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/9/23/5102/pdf

Reference39 articles.

1. Annotating and searching web tables using entities, types and relationships

2. A survey of table recognition

3. WebTables

Cited by 8 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Patents and Public Health: State Responsibility to Opt for a Balanced Approach;Societies;2024-08-13

2. A sentiment analysis method for COVID-19 network comments integrated with semantic concept;Engineering Applications of Artificial Intelligence;2024-02

3. Rumor identification and diffusion impact analysis in real-time text stream using deep learning;The Journal of Supercomputing;2023-11-13

4. Student Cheating Detection in Higher Education by Implementing Machine Learning and LSTM Techniques;Sensors;2023-04-20

5. Predicting Early Withdrawal of University Students: A Comparative Study between KNN and Decision Tree;2023 4th International Conference on Advancements in Computational Sciences (ICACS);2023-02-20