Affiliation:
1. RMIT University
2. The University of Wollongong
3. The University of Queensland
4. Data61, CSIRO
Abstract
In this paper, we study how to acquire labeled data points from a large data pool to enrich a training set for enhancing supervised machine learning (ML) performance. The state-of-the-art solution is the clustering-based training set selection (CTS) algorithm, which initially clusters the data points in a data pool and subsequently selects new data points from clusters. The efficiency of CTS is constrained by its frequent retraining of the target ML model, and the effectiveness is limited by the selection criteria, which represent the state of data points within each cluster and impose a restriction of selecting only one cluster in each iteration. To overcome these limitations, we propose a new algorithm, called CTS with incremental estimation of adaptive score (IAS). IAS employs online learning, enabling incremental model updates by using new data, and eliminating the need to fully retrain the target model, and hence improves the efficiency. To enhance the effectiveness of IAS, we introduce adaptive score estimation, which serves as novel selection criteria to identify clusters and select new data points by balancing trade-offs between exploitation and exploration during data acquisition. To further enhance the effectiveness of IAS, we introduce a new adaptive mini-batch selection method that, in each iteration, selects data points from multiple clusters rather than a single cluster, hence eliminating the potential bias due to using only one cluster. By integrating this method into the IAS algorithm, we propose a novel algorithm termed IAS with adaptive mini-batch selection (IAS-AMS). Experimental results highlight the superior effectiveness of IAS-AMS, with IAS also outperforming other competing algorithms. In terms of efficiency, IAS takes the lead, while the efficiency of IAS-AMS is on par with that of the existing CTS algorithm.
Publisher
Association for Computing Machinery (ACM)
Reference65 articles.
1. Model selection for ecologists: the worldviews of AIC and BIC
2. NYU Auctus. 2024. https://auctus.vida-nyu.org/
3. Peter Auer. 2000. Using Upper Confidence Bounds for Online Learning. In FOCS. 270--279.
4. Continuous monitoring for changepoints in data streams using adaptive estimation
5. Vivek S Borkar. 2009. Stochastic approximation: a dynamical systems viewpoint. Vol. 48. Springer.