Cleaning crowdsourced labels using oracles for statistical classification-Reference-Cited by-同舟云学术

Cleaning crowdsourced labels using oracles for statistical classification

Published:2018-12 Issue:4 Volume:12 Page:376-389
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Dolatshah Mohamad¹,Teoh Mathew¹,Wang Jiannan¹,Pei Jian¹

Affiliation:

1. Simon Fraser University

Abstract

Nowadays, crowdsourcing is being widely used to collect training data for solving classification problems. However, crowdsourced labels are often noisy, and there is a performance gap between classification with noisy labels and classification with ground-truth labels. In this paper, we consider how to apply oracle-based label cleaning to reduce the gap. We propose TARS, a label-cleaning advisor that can provide two pieces of valuable advice for data scientists when they need to train or test a model using noisy labels. Firstly, in the model testing stage, given a test dataset with noisy labels, and a classification model, TARS can use the test data to estimate how well the model will perform w.r.t. ground-truth labels. Secondly, in the model training stage, given a training dataset with noisy labels, and a classification algorithm, TARS can determine which label should be sent to an oracle to clean such that the model can be improved the most. For the first advice, we propose an effective estimation technique, and study how to compute confidence intervals to bound its estimation error. For the second advice, we propose a novel cleaning strategy along with two optimization techniques, and illustrate that it is superior to the existing cleaning strategies. We evaluate TARS on both simulated and real-world datasets. The results show that (1) TARS can use noisy test data to accurately estimate a model's true performance for various evaluation metrics; and (2) TARS can improve the model accuracy by a larger margin than the existing cleaning strategies, for the same cleaning budget.

Publisher

VLDB Endowment

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/3297753.3297758

Cited by 23 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Data cleaning and machine learning: a systematic literature review;Automated Software Engineering;2024-06-11

2. Rock: Cleaning Data by Embedding ML in Logic Rules;Companion of the 2024 International Conference on Management of Data;2024-06-09

3. Recognizing Textual Entailment by Hierarchical Crowdsourcing with Diverse Labor Costs;2024 27th International Conference on Computer Supported Cooperative Work in Design (CSCWD);2024-05-08

4. Perovskite-based optoelectronic systems for neuromorphic computing;Nano Energy;2024-02

5. MisDetect: Iterative Mislabel Detection using Early Loss;Proceedings of the VLDB Endowment;2024-02