A brief survey of web data extraction tools-Reference-Cited by-同舟云学术

A brief survey of web data extraction tools

Published:2002-06 Issue:2 Volume:31 Page:84-93
ISSN:0163-5808
Container-title:ACM SIGMOD Record
language:en
Short-container-title:SIGMOD Rec.

Author:

Laender Alberto H. F.¹,Ribeiro-Neto Berthier A.¹,da Silva Altigran S.¹,Teixeira Juliana S.¹

Affiliation:

1. Federal University of Minas Gerais, Belo Horizonte MG Brazil

Abstract

In the last few years, several works in the literature have addressed the problem of data extraction from Web pages. The importance of this problem derives from the fact that, once extracted, the data can be handled in a way similar to instances of a traditional database. The approaches proposed in the literature to address the problem of Web data extraction use techniques borrowed from areas such as natural language processing, languages and grammars, machine learning, information retrieval, databases, and ontologies. As a consequence, they present very distinct features and capabilities which make a direct comparison difficult to be done. In this paper, we propose a taxonomy for characterizing Web data extraction fools, briefly survey major Web data extraction tools described in the literature, and provide a qualitative analysis of them. Hopefully, this work will stimulate other studies aimed at a more comprehensive analysis of data extraction approaches and tools for Web data.

Publisher

Association for Computing Machinery (ACM)

Subject

Information Systems,Software

Link

https://dl.acm.org/doi/pdf/10.1145/565117.565137

Reference34 articles.

1. NoDoSE---a tool for semi-automatically extracting structured and semistructured data from text documents

Cited by 221 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. FAI: A Fraudulent Account Identification System;Artificial Intelligence;2024

2. Scientific and Technical Information Resources Semistructured Data Ontological Model;2023 IEEE 18th International Conference on Computer Science and Information Technologies (CSIT);2023-10-19

3. Tackling the multilingual and heterogeneous documents with the pre-trained language identifiers;International Journal of Computers and Applications;2023-05-04

4. Industrial safety management in the digital era: Constructing a knowledge graph from near misses;Computers in Industry;2023-04

5. Development of Language Models for Continuous Uzbek Speech Recognition System;Sensors;2023-01-19