Selective data acquisition in the wild for model charging-Reference-Cited by-同舟云学术

Selective data acquisition in the wild for model charging

Published:2022-03 Issue:7 Volume:15 Page:1466-1478
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Chai Chengliang¹,Liu Jiabin¹,Tang Nan²,Li Guoliang¹,Luo Yuyu¹

Affiliation:

1. Tsinghua University, Beijing, China

2. QCRI, Doha, Qatar

Abstract

The lack of sufficient labeled data is a key bottleneck for practitioners in many real-world supervised machine learning (ML) tasks. In this paper, we study a new problem, namely selective data acquisition in the wild for model charging : given a supervised ML task and data in the wild (e.g., enterprise data warehouses, online data repositories, data markets, and so on), the problem is to select labeled data points from the data in the wild as additional train data that can help the ML task. It consists of two steps (Fig. 1). The first step is to discover relevant datasets ( e.g. , tables with similar relational schema), which will result in a set of candidate datasets. Because these candidate datasets come from different sources and may follow different distributions, not all data points they contain can help. The second step is to select which data points from these candidate datasets should be used. We build an end-to-end solution. For step 1, we piggyback off-the-shelf data discovery tools. Technically, our focus is on step 2, for which we propose a solution framework called AutoData. It first clusters all data points from candidate datasets such that each cluster contains similar data points from different sources. It then iteratively picks which cluster to use, samples data points ( i.e. , a mini-batch) from the picked cluster, evaluates the mini-batch, and then revises the search criteria by learning from the feedback ( i.e. , reward) based on the evaluation. We propose a multi-armed bandit based solution and a Deep Q Networks-based reinforcement learning solution. Experiments using both relational and image datasets show the effectiveness of our solutions.

Publisher

Association for Computing Machinery (ACM)

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/3523210.3523223

Reference55 articles.

1. Model selection for ecologists: the worldviews of AIC and BIC

2. Peter Auer. 2000. Using Upper Confidence Bounds for Online Learning. In FOCS. 270--279. Peter Auer. 2000. Using Upper Confidence Bounds for Online Learning. In FOCS . 270--279.

3. Active Sampling for Entity Matching with Guarantees

4. Bing API. 2022. https://docs.microsoft.com/en-us/. Accessed: 2022-03-14. Bing API. 2022. https://docs.microsoft.com/en-us/. Accessed: 2022-03-14.

5. Chengliang Chai , Lei Cao , Guoliang Li , Jian Li , Yuyu Luo , and Samuel Madden . 2020 . Human-in-the-loop Outlier Detection. In SIGMOD Conference 2020. 19--33. Chengliang Chai, Lei Cao, Guoliang Li, Jian Li, Yuyu Luo, and Samuel Madden. 2020. Human-in-the-loop Outlier Detection. In SIGMOD Conference 2020. 19--33.

Cited by 29 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Fainder: A Fast and Accurate Index for Distribution-Aware Dataset Search;Proceedings of the VLDB Endowment;2024-07

2. HAIChart: Human and AI Paired Visualization System;Proceedings of the VLDB Endowment;2024-07

3. PLUTUS: Understanding Data Distribution Tailoring for Machine Learning;Companion of the 2024 International Conference on Management of Data;2024-06-09

4. CoInsight: Visual Storytelling for Hierarchical Tables With Connected Insights;IEEE Transactions on Visualization and Computer Graphics;2024-06

5. Data Acquisition for Improving Model Confidence;Proceedings of the ACM on Management of Data;2024-05-29