Snuba-Reference-Cited by-同舟云学术

Snuba

Published:2018-11 Issue:3 Volume:12 Page:223-236
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Varma Paroma¹,Ré Christopher¹

Affiliation:

1. Stanford University

Abstract

As deep learning models are applied to increasingly diverse problems, a key bottleneck is gathering enough high-quality training labels tailored to each task. Users therefore turn to weak supervision, relying on imperfect sources of labels like pattern matching and user-defined heuristics. Unfortunately, users have to design these sources for each task. This process can be time consuming and expensive: domain experts often perform repetitive steps like guessing optimal numerical thresholds and developing informative text patterns. To address these challenges, we present Snuba, a system to automatically generate heuristics using a small labeled dataset to assign training labels to a large, unlabeled dataset in the weak supervision setting. Snuba generates heuristics that each labels the subset of the data it is accurate for, and iteratively repeats this process until the heuristics together label a large portion of the unlabeled data. We develop a statistical measure that guarantees the iterative process will automatically terminate before it degrades training label quality. Snuba automatically generates heuristics in under five minutes and performs up to 9.74 F1 points better than the best known user-defined heuristics developed over many days. In collaborations with users at research labs, Stanford Hospital, and on open source datasets, Snuba outperforms other automated approaches like semi-supervised learning by up to 14.35 F1 points.

Publisher

VLDB Endowment

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/3291264.3291268

Cited by 72 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. KGRED: Knowledge-graph-based rule discovery for weakly supervised data labeling;Information Processing & Management;2024-09

2. Automating Weak Label Generation for Data Programming with Clinicians in the Loop;2024 IEEE/ACM Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE);2024-06-19

3. Curating Naturally Adversarial Datasets for Learning-Enabled Medical Cyber-Physical Systems;2024 ACM/IEEE 15th International Conference on Cyber-Physical Systems (ICCPS);2024-05-13

4. Language Models in the Loop: Incorporating Prompting into Weak Supervision;ACM / IMS Journal of Data Science;2024-04-08

5. Early detection of fake news on emerging topics through weak supervision;Journal of Intelligent Information Systems;2024-03-15