Genotype sampling for deep-learning assisted experimental mapping of fitness landscapes-Reference-Cited by-同舟云学术

Genotype sampling for deep-learning assisted experimental mapping of fitness landscapes

Published:2024-01-23 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Wagner Andreas

Abstract

AbstractMotivationExperimental characterization of fitness landscapes, which map genotypes onto fitness, is important for both evolutionary biology and protein engineering. It faces a fundamental obstacle in the astronomical number of genotypes whose fitness needs to be measured for any one protein. Deep learning may help to predict the fitness of many genotypes from a smaller neural network training sample of genotypes with experimentally measured fitness. Here I use a recently published experimentally mapped fitness landscape of more than 260,000 protein genotypes to ask how such sampling is best performed.ResultsI show that multilayer perceptrons, recurrent neural networks (RNNs), convolutional networks, and transformers, can explain more than 90 percent of fitness variance in the data. In addition, 90 percent of this performance is reached with a training sample comprising merely ≈103sequences. Generalization to unseen test data is best when training data is sampled randomly and uniformly, or sampled to minimize the number of synonymous sequences. In contrast, sampling to maximize sequence diversity or codon usage bias reduces performance substantially. These observations hold for more than one network architecture. Simple sampling strategies may perform best when training deep learning neural networks to map fitness landscapes from experimental data.

Publisher

Cold Spring Harbor Laboratory

Reference83 articles.

1. Predicting the landscape of recombination using deep learning;Mol. Biol. Evol,2020

2. empirical adaptive landscapes and their navigability;Nature Ecology and Evolution,1000

3. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning

4. Unified rational protein engineering with sequence-based deep representation learning;Nat. Methods,2019

5. Angermueller, C. , et al. DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning. Genome Biol. 2017;18:13.