CERES-Reference-Cited by-同舟云学术

CERES

Published:2018-06 Issue:10 Volume:11 Page:1084-1096
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Lockard Colin¹,Dong Xin Luna²,Einolghozati Arash³,Shiralkar Prashant²

Affiliation:

1. University of Washington

2. Amazon

3. Facebook

Abstract

The web contains countless semi-structured websites, which can be a rich source of information for populating knowledge bases. Existing methods for extracting relations from the DOM trees of semi-structured webpages can achieve high precision and recall only when manual annotations for each website are available. Although there have been efforts to learn extractors from automatically generated labels, these methods are not sufficiently robust to succeed in settings with complex schemas and information-rich websites. In this paper we present a new method for automatic extraction from semi-structured websites based on distant supervision. We automatically generate training labels by aligning an existing knowledge base with a website and leveraging the unique structural characteristics of semi-structured websites. We then train a classifier based on the potentially noisy and incomplete labels to predict new relation instances. Our method can compete with annotation-based techniques in the literature in terms of extraction quality. A large-scale experiment on over 400,000 pages from dozens of multi-lingual long-tail websites harvested 1.25 million facts at a precision of 90%.

Publisher

VLDB Endowment

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/3231751.3231758

Cited by 35 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. EDDVPL: A Web Attribute Extraction Method with Prompt Learning;Communications in Computer and Information Science;2023-11-27

2. Knowledge graph–enabled tolerancing experience acquisition and reuse for tolerance specification;The International Journal of Advanced Manufacturing Technology;2023-11-22

3. Table Discovery in Data Lakes: State-of-the-art and Future Directions;Companion of the 2023 International Conference on Management of Data;2023-06-04

4. Pre-training language model incorporating domain-specific heterogeneous knowledge into a unified representation;Expert Systems with Applications;2023-04

5. DOM2R-Graph: A Web Attribute Extraction Architecture with Relation-Aware Heterogeneous Graph Transformer;Neural Information Processing;2023