On the sparsity of fitness functions and implications for learning-Reference-Cited by-同舟云学术

On the sparsity of fitness functions and implications for learning

Published:2021-12-22 Issue:1 Volume:119 Page:e2109649118
ISSN:0027-8424
Container-title:Proceedings of the National Academy of Sciences
language:en
Short-container-title:Proc Natl Acad Sci USA

Author:

Brookes David H.^ORCID,Aghazadeh Amirali^ORCID,Listgarten Jennifer^ORCID

Abstract

Fitness functions map biological sequences to a scalar property of interest. Accurate estimation of these functions yields biological insight and sets the foundation for model-based sequence design. However, the fitness datasets available to learn these functions are typically small relative to the large combinatorial space of sequences; characterizing how much data are needed for accurate estimation remains an open problem. There is a growing body of evidence demonstrating that empirical fitness functions display substantial sparsity when represented in terms of epistatic interactions. Moreover, the theory of Compressed Sensing provides scaling laws for the number of samples required to exactly recover a sparse function. Motivated by these results, we develop a framework to study the sparsity of fitness functions sampled from a generalization of the NK model, a widely used random field model of fitness functions. In particular, we present results that allow us to test the effect of the Generalized NK (GNK) model’s interpretable parameters—sequence length, alphabet size, and assumed interactions between sequence positions—on the sparsity of fitness functions sampled from the model and, consequently, the number of measurements required to exactly recover these functions. We validate our framework by demonstrating that GNK models with parameters set according to structural considerations can be used to accurately approximate the number of samples required to recover two empirical protein fitness functions and an RNA fitness function. In addition, we show that these GNK models identify important higher-order epistatic interactions in the empirical fitness functions using only structural information.

Publisher

Proceedings of the National Academy of Sciences

Subject

Multidisciplinary

Reference59 articles.

1. Inferring fitness landscapes by regression produces biased estimates of epistasis

2. Biophysical inference of epistasis and the effects of mutations on protein stability and function;Otwinowski;Mol. Biol. Evol.,2018

3. Sparse epistatic patterns in the evolution of terpene synthases;Ballal;Mol. Biol. Evol.,2020

4. Navigating the protein fitness landscape with Gaussian processes

5. Minimum epistasis interpolation for sequence-function relationships;Zhou;Nat. Commun.,2020

Cited by 27 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. The simplicity of protein sequence-function relationships;Nature Communications;2024-09-11

2. An extension of the Walsh-Hadamard transform to calculate and model epistasis in genetic landscapes of arbitrary shape and complexity;PLOS Computational Biology;2024-05-28

3. Epistasis facilitates functional evolution in an ancient transcription factor;eLife;2024-05-20

4. Symmetry, gauge freedoms, and the interpretability of sequence-function relationships;2024-05-13

5. Gauge fixing for sequence-function relationships;2024-05-13