Abstract
ABSTRACTUnderstanding the interaction between T Cell Receptors (TCRs) and peptide-bound Major Histocompatibility Complexes (pMHCs) is crucial for comprehending immune responses and developing targeted immunotherapies. Recent machine learning (ML) models excel at TCR-pMHC binding for training sequences. However, they underperform with peptides outside this distribution, raising concerns about their applicability in therapeutic settings.To address this issue, we evaluate the effect of the distance between training and testing peptide distributions on ML model risk assessments, using sequence-based and 3D structure-based distance metrics. In our analysis we use two state-of-the-art models for TCR-peptide binding prediction: Attentive Variational Information Bottleneck (AVIB) and NetTCR-2.0.Our hypothesis posited that increased similarity between test and training peptides could lead to inflated estimates of the true generalization error. However, our results indicate that 3D structural similarity metrics, rather than sequence-based metrics, are a better predictor of the model’s generalization performance.These findings highlight the importance of using the 3D structure in benchmarking the performance of TCR-pMHC binding prediction models. Specifically, it helps identify sudden generalization drops as the distance between training and test data distributions increases. Consequently, we recommend using structure-based over sequence-based distance methods for more reliable and accurate evaluations in TCR-pMHC binding studies.
Publisher
Cold Spring Harbor Laboratory
Reference41 articles.
1. Alberts, B. , Johnson, A. , Lewis, J. , Morgan, D. , Raff, M. , Roberts, K. , et al. (2017). Molecular biology of the cell (WW Norton & Company)
2. Vdjdb in 2019: database extension, new analysis infrastructure and a t-cell receptor motif compendium;Nucleic Acids Research,2020
3. A new soaking procedure for X-ray crystallographic structural determination of protein–peptide complexes
4. Announcing the worldwide Protein Data Bank
5. Timed-design: Efficient protein sequence design with deep learning;Zenodo,2022