Abstract
In this paper, we consider the problem of constructing wrappers for web information extraction that are robust to changes in websites. We consider two models to study robustness formally: the
adversarial
model, where we look at the worst-case robustness of wrappers, and
probabilistic
model, where we look at the expected robustness of wrappers, as web-pages evolve. Under both models, we present efficient algorithms for constructing the provably most robust wrapper. By evaluating on real websites, we demonstrate that in practice, our algorithms are highly effective in coping up with changes in websites, and reduce the wrapper breakage by up to 500% over existing techniques.
Subject
General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development
Cited by
9 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Learning Transferable Node Representations for Attribute Extraction from Web Documents;Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining;2022-02-11
2. Automatic news-roundup generation using clustering, extraction, and presentation;Multimedia Systems;2019-11-09
3. Robust Web Data Extraction Based on Unsupervised Visual Validation;Intelligent Information and Database Systems;2019
4. Robust and Noise Resistant Wrapper Induction;Proceedings of the 2016 International Conference on Management of Data;2016-06-14
5. Web Content Extraction;ACM SIGKDD Explorations Newsletter;2016-02-25