Abstract
AbstractProtein language models (pLMs) trained on large protein sequence databases have been used to understand disease and design novel proteins. In design tasks, the likelihood of a protein sequence under a pLM is often used as a proxy for protein fitness, so it is critical to understand what signals likelihoods capture. In this work we find that pLM likelihoods unintentionally encode a species bias: likelihoods of protein sequences from certain species are systematically higher, independent of the protein in question. We quantify this bias and show that it arises in large part because of unequal species representation in popular protein sequence databases. We further show that the bias can be detrimental for some protein design applications, such as enhancing thermostability. These results highlight the importance of understanding and curating pLM training data to mitigate biases and improve protein design capabilities in under-explored parts of sequence space.
Publisher
Cold Spring Harbor Laboratory
Reference62 articles.
1. Sarah Alamdari , Nitya Thakkar , Rianne van den Berg , Alex Xijie Lu , Nicolo Fusi , Ava Pardis Amini , and Kevin K Yang . Protein generation with evolutionary diffusion: sequence is all you need. bioRxiv, pp. 2023–09, 2023.
2. Unified rational protein engineering with sequence-based deep representation learning;Nature methods,2019
3. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000
4. David Baker and George Church . Protein design meets biosecurity, 2024.
5. Tolga Bolukbasi , Kai-Wei Chang , James Y Zou , Venkatesh Saligrama , and Adam T Kalai . Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems, 29, 2016.
Cited by
9 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献