Accuracy and data efficiency in deep learning models of protein expression-Reference-Cited by-同舟云学术

Accuracy and data efficiency in deep learning models of protein expression

Published:2022-12-15 Issue:1 Volume:13 Page:
ISSN:2041-1723
Container-title:Nature Communications
language:en
Short-container-title:Nat Commun

Author:

Nikolados Evangelos-Marios,Wongprommoon Arin,Aodha Oisin Mac,Cambray Guillaume,Oyarzún Diego A.^ORCID

Abstract

AbstractSynthetic biology often involves engineering microbial strains to express high-value proteins. Thanks to progress in rapid DNA synthesis and sequencing, deep learning has emerged as a promising approach to build sequence-to-expression models for strain optimization. But such models need large and costly training data that create steep entry barriers for many laboratories. Here we study the relation between accuracy and data efficiency in an atlas of machine learning models trained on datasets of varied size and sequence diversity. We show that deep learning can achieve good prediction accuracy with much smaller datasets than previously thought. We demonstrate that controlled sequence diversity leads to substantial gains in data efficiency and employed Explainable AI to show that convolutional neural networks can finely discriminate between input DNA sequences. Our results provide guidelines for designing genotype-phenotype screens that balance cost and quality of training data, thus helping promote the wider adoption of deep learning in the biotechnology sector.

Funder

Darwin Trust of Edinburgh

Publisher

Springer Science and Business Media LLC

Subject

General Physics and Astronomy,General Biochemistry, Genetics and Molecular Biology,General Chemistry,Multidisciplinary

Link

https://www.nature.com/articles/s41467-022-34902-5.pdf

Reference53 articles.

1. Terpe, K. Overview of bacterial expression systems for heterologous protein production: from molecular and biochemical fundamentals to commercial systems. Appl. Microbiol. Biotechnol. 72, 211–222 (2006).

2. Sørensen, H. P. & Mortensen, K. K. Advanced genetic strategies for recombinant protein expression in Escherichia coli. J. Biotechnol. 115, 113–128 (2005).

3. Blazeck, J. & Alper, H. S. Promoter engineering: recent advances in controlling transcription at the most fundamental level. Biotechnol. J. 8, 46–58 (2013).