Self-Training for Label-Efficient Information Extraction from Semi-Structured Web-Pages-Reference-Cited by-同舟云学术

Self-Training for Label-Efficient Information Extraction from Semi-Structured Web-Pages

Published:2023-07 Issue:11 Volume:16 Page:3098-3110
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Sarkhel Ritesh¹,Huang Binxuan¹,Lockard Colin¹,Shiralkar Prashant¹

Affiliation:

1. Amazon, Seattle, Washington

Abstract

Information Extraction (IE) from semi-structured web-pages is a long studied problem. Training a model for this extraction task requires a large number of human-labeled samples. Prior works have proposed transferable models to improve the label-efficiency of this training process. Extraction performance of transferable models however, depends on the size of their fine-tuning corpus. This holds true for large language models (LLM) such as GPT-3 as well. Generalist models like LLMs need to be fine-tuned on in-domain, human-labeled samples for competitive performance on this extraction task. Constructing a large-scale fine-tuning corpus with human-labeled samples, however, requires significant effort. In this paper, we develop a Label-Efficient Self-Training Algorithm (LEAST) to improve the label-efficiency of this fine-tuning process. Our contributions are two-fold. First , we develop a generative model that facilitates the construction of a large-scale fine-tuning corpus with minimal human-effort. Second , to ensure that the extraction performance does not suffer due to noisy training samples in our fine-tuning corpus, we develop an uncertainty-aware training strategy. Experiments on two publicly available datasets show that LEAST generalizes to multiple verticals and backbone models. Using LEAST, we can train models with less than ten human-labeled pages from each website, outperforming strong baselines while reducing the number of human-labeled training samples needed for comparable performance by up to 11 x.

Publisher

Association for Computing Machinery (ACM)

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/3611479.3611511

Reference54 articles.

1. Large language models are few-shot clinical information extractors

2. Eleuther AI. 2021. The GPT-Neo 1.3B model. Accessed: 2023-04-05. Eleuther AI. 2021. The GPT-Neo 1.3B model. Accessed: 2023-04-05.

3. Massih-Reza Amini , Vasilii Feofanov , Loic Pauletto , Emilie Devijver , and Yury Maximov . 2022 . Self-training: A survey. arXiv preprint arXiv:2202.12040 (2022). Massih-Reza Amini, Vasilii Feofanov, Loic Pauletto, Emilie Devijver, and Yury Maximov. 2022. Self-training: A survey. arXiv preprint arXiv:2202.12040 (2022).

4. Wrapper approaches for web data extraction : A review

5. Unsupervised Extraction of Popular Product Attributes from E-Commerce Web Sites by Considering Customer Reviews