Affiliation:
1. York University, Toronto, Ontario, Canada
2. University of Toronto, Toronto, Ontario, Canada
Abstract
In recent years, there has been a growing recognition that high-quality training data is crucial for the performance of machine learning models. This awareness has catalyzed both research endeavors and industrial initiatives dedicated to data acquisition to enhance diverse dimensions of model performance. Among these dimensions, model confidence holds paramount importance; however, it has often been overlooked in prior investigations into data acquisition methodologies. To address this gap, our work focuses on improving the data acquisition process with the goal of enhancing the confidence of Machine Learning models. Specifically, we operate within a practical context where limited samples can be obtained from a large data pool. We employ well-established model confidence metrics as our foundation, and we propose two methodologies, Bulk Acquisition (BA) and Sequential Acquisition (SA), each geared towards identifying the sets of samples that yield the most substantial gains in model confidence. Recognizing the complexity of BA and SA, we introduce two efficient approximate methods, namely kNN-BA and kNN-SA, restricting data acquisition to promising subsets within the data pool. To broaden the applicability of our solutions, we introduce a Distribution-based Acquisition approach that makes minimal assumption regarding the data pool and facilitates the data acquisition across various settings. Through extensive experimentation encompassing diverse datasets, models, and parameter configurations, we demonstrate the efficacy of our proposed methods across a range of tasks. Comparative experiments with alternative applicable baselines underscore the superior performance of our proposed approaches.
Publisher
Association for Computing Machinery (ACM)