Automatic information extraction from large websites-Reference-Cited by-同舟云学术

Automatic information extraction from large websites

Published:2004-09 Issue:5 Volume:51 Page:731-779
ISSN:0004-5411
Container-title:Journal of the ACM
language:en
Short-container-title:J. ACM

Author:

Crescenzi Valter¹,Mecca Giansalvatore²

Affiliation:

1. Università di Roma Tre

2. Università della Basilicata, Potenza, Italy

Abstract

Information extraction from websites is nowadays a relevant problem, usually performed by software modules called wrappers. A key requirement is that the wrapper generation process should be automated to the largest extent, in order to allow for large-scale extraction tasks even in presence of changes in the underlying sites. So far, however, only semi-automatic proposals have appeared in the literature.We present a novel approach to information extraction from websites, which reconciles recent proposals for supervised wrapper induction with the more traditional field of grammar inference. Grammar inference provides a promising theoretical framework for the study of unsupervised---that is, fully automatic---wrapper generation algorithms. However, due to some unrealistic assumptions on the input, these algorithms are not practically applicable to Web information extraction tasks.The main contributions of the article stand in the definition of a class of regular languages, called the prefix mark-up languages, that abstract the structures usually found in HTML pages, and in the definition of a polynomial-time unsupervised learning algorithm for this class. The article shows that, differently from other known classes, prefix mark-up languages and the associated algorithm can be practically used for information extraction purposes.A system based on the techniques described in the article has been implemented in a working prototype. We present some experimental results on known Websites, and discuss opportunities and limitations of the proposed approach.

Publisher

Association for Computing Machinery (ACM)

Subject

Artificial Intelligence,Hardware and Architecture,Information Systems,Control and Systems Engineering,Software

Link

https://dl.acm.org/doi/pdf/10.1145/1017460.1017462

Reference40 articles.

1. Inductive inference of formal languages from positive data;Angluin D.;Inf. Cont.,1980

2. Inference of Reversible Languages

Cited by 85 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Automated information collection for thematic interpretation of a Quranic term;PROCEEDINGS OF THE 4TH INTERNATIONAL COMPUTER SCIENCES AND INFORMATICS CONFERENCE (ICSIC 2022);2023

2. Automatic signboard detection and localization in densely populated developing cities;Signal Processing: Image Communication;2022-11

3. WebFormer: The Web-page Transformer for Structure Information Extraction;Proceedings of the ACM Web Conference 2022;2022-04-25

4. Convolutional Neural Networks Base Text Recognition;SSRN Electronic Journal;2022

5. The smallest extraction problem;Proceedings of the VLDB Endowment;2021-07