Geographically Biased Composition of NetMHCpan Training Datasets and Evaluation of MHC-Peptide Binding Prediction Accuracy on Novel Alleles
Author:
Atkins Thomas KarlORCID, Solanki ArnavORCID, Vasmatzis GeorgeORCID, Cornette James, Riedel MarcORCID
Abstract
AbstractBias in neural network model training datasets has been observed to decrease prediction accuracy for groups underrepresented in training data. Thus, investigating the composition of training datasets used in machine learning models with health-care applications is vital to ensure equity. Two such machine learning models are NetMHCpan-4.1 and NetMHCIIpan-4.0, used to predict antigen binding scores to major histocompatibility complex class I and II molecules, respectively. As antigen presentation is a critical step in mounting the adaptive immune response, previous work has used these or similar predictions models in a broad array of applications, from explaining asymptomatic viral infection to cancer neoantigen prediction. However, these models have also been shown to be biased toward hydrophobic peptides, suggesting the network could also contain other sources of bias. Here, we report the composition of the networks’ training datasets are heavily biased toward European Caucasian individuals and against Asian and Pacific Islander individuals. We test the ability of NetMHCpan-4.1 and NetMHCpan-4.0 to distinguish true binders from randomly generated peptides on alleles not included in the training datasets. Unexpectedly, we fail to find evidence that the disparities in training data lead to a meaningful difference in prediction quality for alleles not present in the training data. We attempt to explain this result by mapping the HLA sequence space to determine the sequence diversity of the training dataset. Furthermore, we link the residues which have the greatest impact on NetMHCpan predictions to structural features for three alleles (HLA-A*34:01, HLA-C*04:03, HLA-DRB1*12:02).
Publisher
Cold Spring Harbor Laboratory
Reference32 articles.
1. “Human leukocyte antigen gene polymorphism and the histocompatibility laboratory;The Journal of Molecular Diagnostics,2001 2. L. A. Rojas , Z. Sethna , K. C. Soares , C. Olcese , N. Pang , E. Patterson , J. Lihm , N. Ceglia , P. Guasp , A. Chu , R. Yu , A. K. Chandra , T. Waters , J. Ruan , M. Amisaki , A. Zebboudj , Z. Odgerel , G. Payne , E. Derhovanessian , F. Müller , I. Rhee , M. Yadav , A. Dobrin , M. Sadelain , M. L uksza , N. Cohen , L. Tang , O. Basturk , M. Gönen , S. Katz , R. K. Do , A. S. Epstein , P. Momtaz , W. Park , R. Sugarman , A. M. Varghese , E. Won , A. Desai , A. C. Wei , M. I. D’Angelica , T. P. Kingham , I. Mellman , T. Merghoub , J. D. Wolchok , U. Sahin , Özlem Türeci , B. D. Greenbaum , W. R. Jarnagin , J. Drebin , E. M. O’Reilly , and V. P. Balachandran , “Personalized RNA neoantigen vaccines stimulate t cells in pancreatic cancer,” Nature, May 2023. 3. “A common allele of hla is associated with asymptomatic sars-cov-2 infection;Nature,2023 4. D. F. Marzella , F. M. Parizi , D. van Tilborg , N. Renaud , D. Sybrandi , R. Buzatu , D. T. Rademaker , P. A. C. ‘t Hoen , and L. C. Xue , “PANDORA: A fast, anchor-restrained modelling protocol for peptide: MHC complexes,” Frontiers in Immunology, vol. 13, May 2022. 5. CHARMM: The biomolecular simulation program
|
|