Abstract
AbstractMachine learning models in bioinformatics are often trained and used within the scope of a single project, but some models are also reused across projects and deployed in translational settings. Over time, trained models may turn out to be maladjusted to the properties of new data. This creates the need to improve their performance under various constraints. This work explores correcting models without retraining from scratch and without accessing the original training data. It uses a taxonomy of strategies to guide the development of a software package, ‘mlensemble’. Key features include joining heterogeneous models into ensembles and calibrating ensembles to the properties of new data. These are well-established techniques but are often hidden within more complex tools. By exposing them to the application level, the package enables analysts to use expert knowledge to adjust models whenever needed. Calculations with imaging data show benefits when the noise characteristics of the training and the application datasets differ. An example using genomic single-cell data demonstrates model portability despite batch effects. The generality of the framework makes it applicable also in other subject domains.
Publisher
Cold Spring Harbor Laboratory