Abstract
AbstractUnderstanding the linkage between protein sequence and phenotypic expression level is crucial in biotechnology. Machine learning algorithms trained with deep mutational scanning (DMS) data have significant potential to improve this understanding and accelerate protein engineering campaigns. However, most machine learning (ML) approaches in this domain do not directly address effects of synonymous codons or positional epistasis on predicted expression levels. Here we used yeast surface display, deep mutational scanning, and next-generation DNA sequencing to quantify the expression fitness landscape of human myoglobin and train ML models to predict epistasis of double codon mutants. When fed with near comprehensive single mutant DMS data, our algorithm computed expression fitness values for double codon mutants using ML-predicted epistasis as an intermediate parameter. We next deployed this predictive model to screen > 3·106unseen double codon mutantsin silicoand experimentally tested highly ranked candidate sequences, finding 14 of 16 with significantly enhanced expression levels. Our experimental DMS dataset combined with codon level epistasis-based ML constitutes an effective method for bootstrapping fitness predictions of high order mutational variants using experimental data from variants of lower order.
Publisher
Cold Spring Harbor Laboratory