Abstract
AbstractProtein language models (pLMs) are ubiquitous across biological machine learning research, but state-of-the-art models like ESM2 take hundreds of thousands of GPU hours to pre-train on the vast protein universe. Resource requirements for scaling up pLMs prevent fundamental investigations into how optimal modeling choices might differ from those used in natural language. Here, we define a “cramming” challenge for pLMs and train performant models in 24 hours on a single GPU. By re-examining many aspects of pLM training, we are able to train a 67 million parameter model in a single day that achieves comparable performance on downstream protein fitness landscape inference tasks to ESM-3B, a model trained for over 15, 000×more GPU hours than ours. We open source our library1for training and inference,LBSTER:Language models forBiologicalSequenceTransformation andEvolutionaryRepresentation.
Publisher
Cold Spring Harbor Laboratory
Reference19 articles.
1. Negatome 2.0: a database of non-interacting proteins derived by literature mining, manual annotation and protein structure analysis
2. Bo Chen , Xingyi Cheng , Yangli-ao Geng , Shen Li , Xin Zeng , Boyan Wang , Jing Gong , Chiming Liu , Aohan Zeng , Yuxiao Dong , Jie Tang , and Le Song . xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein, July 2023. URL https://www.biorxiv.org/content/10.1101/2023.07.05.547496v3. Pages: 2023.07.05.547496 Section: New Results.
3. Christian Dallago , Jody Mou , Kadina E Johnston , Bruce J Wittmann , Nicholas Bhattacharya , Samuel Goldman , Ali Madani , and Kevin K Yang . Flip: Benchmark tasks in fitness landscape inference for proteins. bioRxiv, pp. 2021–11, 2021.
4. Boris Dayma , Suraj Patil , Pedro Cuenca , Khalid Saifullah , Tanishq Abraham , Phúc LêKhc , Luke Melas , and Ritobrata Ghosh . Dall·e mini, 7 2021. URL https://github.com/borisdayma/dalle-mini.
5. Bert: Pre-training of deep bidirectional transformers for language understanding;arXiv preprint,2018