Adapting Web information extraction knowledge via mining site-invariant and site-dependent features-Reference-Cited by-同舟云学术

Adapting Web information extraction knowledge via mining site-invariant and site-dependent features

Published:2007-02 Issue:1 Volume:7 Page:6
ISSN:1533-5399
Container-title:ACM Transactions on Internet Technology
language:en
Short-container-title:ACM Trans. Internet Technol.

Author:

Wong Tak-Lam¹,Lam Wai²

Affiliation:

1. City University of Hong Kong, Kowloon, Hong Kong

2. The Chinese University of Hong Kong, Shatin, Hong Kong

Abstract

We develop a novel framework that aims at automatically adapting previously learned information extraction knowledge from a source Web site to a new unseen target site in the same domain. Two kinds of features related to the text fragments from the Web documents are investigated. The first type of feature is called, a site-invariant feature. These features likely remain unchanged in Web pages from different sites in the same domain. The second type of feature is called a site-dependent feature. These features are different in the Web pages collected from different Web sites, while they are similar in the Web pages originating from the same site. In our framework, we derive the site-invariant features from previously learned extraction knowledge and the items previously collected or extracted from the source Web site. The derived site-invariant features will be exploited to automatically seek a new set of training examples in the new unseen target site. Both the site-dependent features and the site-invariant features of these automatically discovered training examples will be considered in the learning of new information extraction knowledge for the target site. We conducted extensive experiments on a set of real-world Web sites collected from three different domains to demonstrate the performance of our framework. For example, by just providing training examples from one online book catalog Web site, our approach can automatically extract information from ten different book catalog sites achieving an average precision and recall of 71.9% and 84.0% respectively without any further manual intervention.

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications

Link

https://dl.acm.org/doi/pdf/10.1145/1189740.1189746

Reference42 articles.

1. Adaptive duplicate detection using learnable string similarity measures

2. 10.1162/153244304322972685

Cited by 14 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Unsupervised Extraction of Popular Product Attributes from E-Commerce Web Sites by Considering Customer Reviews;ACM Transactions on Internet Technology;2016-04-20

2. An Adaptive Web Information Extraction Approach Based on STU-DOM Tree;Applied Mechanics and Materials;2013-09

3. Mining Product Features from the Web: A Self-supervised Approach;Lecture Notes in Business Information Processing;2013

4. A Fast Method for Web Template Extraction via a Multi-sequence Alignment Approach;Communications in Computer and Information Science;2013

5. Web Interface Interpretation Using Graph Grammars;IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews);2012-07