Snorkel-Reference-Cited by-同舟云学术

Snorkel

Published:2017-11 Issue:3 Volume:11 Page:269-282
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Ratner Alexander¹,Bach Stephen H.¹,Ehrenberg Henry¹,Fries Jason¹,Wu Sen¹,Ré Christopher¹

Affiliation:

1. Stanford University

Abstract

Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train state-of-the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research labs. In a user study, subject matter experts build models 2.8X faster and increase predictive performance an average 45.5% versus seven hours of hand labeling. We study the modeling tradeoffs in this new setting and propose an optimizer for automating tradeoff decisions that gives up to 1.8X speedup per pipeline execution. In two collaborations, with the U.S. Department of Veterans Affairs and the U.S. Food and Drug Administration, and on four open-source text and image data sets representative of other deployments, Snorkel provides 132% average improvements to predictive performance over prior heuristic approaches and comes within an average 3.60% of the predictive performance of large hand-curated training sets.

Publisher

VLDB Endowment

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/3157794.3157797

Cited by 352 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. KGRED: Knowledge-graph-based rule discovery for weakly supervised data labeling;Information Processing & Management;2024-09

2. A Weak Supervision-Based Approach to Improve Chatbots for Code Repositories;Proceedings of the ACM on Software Engineering;2024-07-12

3. Rethinking Software Engineering in the Era of Foundation Models: A Curated Catalogue of Challenges in the Development of Trustworthy FMware;Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering;2024-07-10

4. Understanding the impact of climate change on critical infrastructure through nlp analysis of scientific literature;Sustainable and Resilient Infrastructure;2024-07-02

5. Weakly supervised classification through manifold learning and rank-based contextual measures;Neurocomputing;2024-07