Training data composition determines machine learning generalization and biological rule discovery-Reference-Cited by-同舟云学术

Training data composition determines machine learning generalization and biological rule discovery

Published:2024-06-19 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Ursu Eugen^ORCID,Minnegalieva Aygul^ORCID,Rawat Puneet^ORCID,Chernigovskaya Maria^ORCID,Tacutu Robi^ORCID,Sandve Geir Kjetil^ORCID,Robert Philippe A.^ORCID,Greiff Victor^ORCID

Abstract

AbstractSupervised machine learning models rely on training datasets with positive (target class) and negative examples. Therefore, the composition of the training dataset has a direct influence on model performance. Specifically, negative sample selection bias, concerning samples not representing the target class, presents challenges across a range of domains such as text classification and protein-protein interaction prediction. Machine-learning-based immunotherapeutics design is an increasingly important area of research, focusing on designing antibodies or T-cell receptors (TCRs) that can bind to their target molecules with high specificity and affinity. Given the biomedical importance of immunotherapeutics, there is a need to address the unresolved question of how negative training set composition impacts model generalization and biological rule discovery to enable rational and safe drug design. We set out to study this question in the context of the antibody-antigen binding prediction problem by varying the negative class, encompassing a binding affinity gradient. We based our investigation on large synthetic datasets that provide ground truth structure-based antibody-antigen binding data, allowing access to residue-wise binding energy on the binding interface. We found that both out-of-distribution generalization and binding rule discovery depended on the type of negative dataset used. Importantly, we discovered that a model’s capacity to learn the binding rules of the positive dataset is not a trivial correlate of its classification accuracy. We confirmed our findings with real-world relevant experimental data. Our work highlights the importance of considering training dataset composition for achieving optimal out-of-distribution performance and rule learning in machine-learning-based research.Significance StatementThe effectiveness of supervised machine learning models hinges on the composition of their training datasets, particularly the inclusion of negative examples. This bias in negative sample selection can greatly impact model performance. As the development of immunotherapeutic agents using machine learning is becoming increasingly crucial in biomedicine, understanding the impact of negative training set composition is imperative. Our study, focused on the antibody-antigen binding prediction problem, reveals that the choice of negative dataset significantly affects both out-of-distribution generalization and binding rule discovery across synthetic and experimental data. These findings underscore the necessity of carefully considering training dataset composition in machine-learning-driven biomedical research for optimal performance, robustness and meaningful rule acquisition.

Publisher

Cold Spring Harbor Laboratory

Reference84 articles.

1. Investigating the Volume and Diversity of Data Needed for Generalizable Antibody-Antigen ∆∆G Prediction

2. R. Yang , J. Mao , P. Chaudhari , Does the data induce capacity control in deep learning?. International Conference on Machine Learning, 25166 (2022).

3. Geometric dataset distances via optimal transport;Advances in Neural Information Processing Systems,2020

4. T. Wang , P. Isola , Understanding contrastive representation learning through alignment and uniformity on the hypersphere. International conference on machine learning, 9929 (2020).